ML Section: Predicting energy consumption at an hourly frequency for San Diego Gas & Electric (SDGE) utility region


Table of contents:

Introduction: Time Series Forecasting

1. Importing cleaned dataset

2. Defining Basic Functions

3. Simple Linear Regression models

$\quad$ 3.1. Statsmodels

$\quad$ 3.2. SkLearn Linear Regression

$\quad$$\quad$$\quad$ 3.2.1 Baseline Persistent Forecast

$\quad$ 3.3. Ridge Regression

$\quad$ 3.4. Feature engineering for hourly terms

$\quad$ 3.5. Elastic Net Regression

4. Random Forest Regression

5. Adding lags as X variables (Elastic net & RF)

6. Time Series Features and Models

$\quad$ 6.1. Basic Time series structure, Stationarity, Auto Correlation

$\quad$ 6.2. Handling Multiple seasonality

$\quad$ 6.3. SARIMAX

$\quad$ 6.4. FB Prophet

7. Regression models using Fourier terms

$\quad$ 7.1. Elastic net

$\quad$ 7.2. Random Forest

$\quad$ 7.3. XGBoost

$\quad$ 7.4. XGBoost + FB Prophet trend

8. Conclusion

9. Future work




Introduction Time Series Forecasting


Time series is a sequence of observations taken sequentially in time. Our aim here, as discussed in the milestone report is-

"The developed prediction model can be utilized by the electrical utilities to effectively plan their energy generation operations and balance the demand with appropriate supply. An efficient forecast can prove very useful for the utilities in planning their day to day operations, meeting their customers’ energy demand, and avoiding excess generation of energy."

Basically we want to fit a ML model onto our energy consumption time series and use that to predict the future energy consumption. Our time series dataset has dependent variable 'SDGE' which represents the energy consumption of the SDGE region in MWH and also has independent variables such as temperature in deg F, cumulative PV installation till date in kW and holidays and since it's a time series I have added some additional time dependent features such as hour of the day, day of the week, day of the month, season, etc. We can use all these features to train our model and use it predict the future energy values.

Any time series data has the following components: ref link

  • Level: The baseline value for the series if it were a straight line.
  • Trend: The optional and often linear increasing or decreasing behavior of the series over time.
  • Seasonality: The optional repeating patterns or cycles of behavior over time.
  • Noise: The optional variability in the observations that cannot be explained by the model.

Another important feature of most time series is that observations close together in time tend to be correlated (serially dependent). This feature is also called as auto-correlation and forms an important aspect of the conventional time series modeling techniques like ARIMA.

Modeling a time series involves a slightly different approach than regular ML problems. Here is a good link explaining the main differences.


In this notebook I have used the basic time series models like ARIMA and FB-prophet and then have extended my approach to include linear regression, random forests and XGBoost to see whether or not these linear and non-linear approaches can model our time series accurately.

At the end I have listed all the models with their error metrics based on which we can conclude which model/s perform the best. In any time series problem it is very important to define the window of prediction beforehand. Here I have tested the models on a future window of 1-hour, 1-week and 8 months. 1-hour ahead and 1-week ahead forecasting is easier and can be done using the lagged values of the energy consumption to increase the accuracy of the forecasts. Simple models like SARIMAX, Elastic net and Random Forest regression give around 98% R2 score and 1~2% MAPE using lag variables for short term 1-hour ahead and 1-week ahead forecasts. But if we need to forecast over a long term window (more than 1 month), we will see that FB Prophet and XGBoost (with Fourier terms for seasonalities) perform very well for longer forecast windows.

I have used the following forecasting metrics to measure the prediction performance of the models used: reference

R squared: coefficient of determination (this can be interpreted as the percentage of variance explained by the model), $(-\infty, 1]$
$R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$
sklearn.metrics.r2_score

Mean Absolute Error: this has the same unit of measurment as the initial series, $[0, +\infty)$
$MAE = \frac{\sum\limits_{i=1}^{n} |y_i - \hat{y}_i|}{n}$
sklearn.metrics.mean_absolute_error

Root Mean Squared Error: the most commonly used metric that gives a higher penalty to large errors and vice versa, this too has the same unit of measurment as the initial series $[0, +\infty)$
$RMSE = \sqrt(\frac{1}{n}\sum\limits_{i=1}^{n} (y_i - \hat{y}_i)^2)$
np.sqrt(sklearn.metrics.mean_squared_error)

Mean Absolute Percentage Error: this is the same as MAE but is computed as a percentage, which is very convenient when you want to explain the quality of the model to management, $[0, +\infty)$
$MAPE = \frac{100}{n}\sum\limits_{i=1}^{n} \frac{|y_i - \hat{y}_i|}{y_i}$

def mean_absolute_percentage_error(y_true, y_pred):
$\quad$ return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

Out of the 4 metrics, MAPE will be chosen to sort the models. MAPE helps us to understand the % error on the absolute value that can be expected from a model. For example, if the MAPE is 1% and the observed energy value is 4000 MW, we can say that our model on average can preidct accurately within a $+or- 40MW$ error band.

The error metrics for all the models will be comapred to a baseline of persistent forecast wherein simply the last year values are used for the forecast (when forecasting over longer window of >1 month).

Time series cross validation:
Cross-validation for time series is a bit different because time series have this temporal structure and one cannot randomly mix values in a fold while preserving this structure. With randomization, all time dependencies between observations will be lost. This is why we will have to use a more tricky approach in optimizing the model parameters such as "cross-validation on a rolling basis".

The idea is rather simple -- we train our model on a small segment of the time series from the beginning until some $t$, make predictions for the next $t+n$ steps, and calculate an error. Then, we expand our training sample to $t+n$ value, make predictions from $t+n$ until $t+2*n$, and continue moving our test segment of the time series until we hit the last available observation. As a result, we have as many folds as $n$ will fit between the initial training sample and the last observation. This can be established using the sklearn.model_selection's TimeSeriesSplit module. Reference image.png


1. Importing cleaned dataset

Importing the required libraries and modules

In [1]:
# Import basic modules
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import sklearn

import seaborn as sns
from matplotlib import rcParams

# Import regression and error metrics modules
import statsmodels.api as sm
from statsmodels.formula.api import ols
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Import plotly modules to view time series in a more interactive way
import plotly.graph_objects as go
import pandas as pd

# Standard scaler for preprocessing
from sklearn.preprocessing import StandardScaler

# Importing time series split for cross validation 
from sklearn.model_selection import TimeSeriesSplit

plt.style.use('bmh')

# special IPython command to prepare the notebook for matplotlib and other libraries
%matplotlib inline 

#sns.set_style("whitegrid")
#sns.set_context("poster")
In [2]:
# Import the 5 years of hourly energy consumption data previously cleaned, explored, feature-engineered and... 
# ..stored as hourly1418_energy_temp_PV.csv

sdge = pd.read_csv('hourly1418_energy_temp_PV.csv', index_col = 'Dates', parse_dates=['Dates', 'Date'])
In [3]:
sdge.head()
Out[3]:
SDGE Date year month day hour weekday season holiday non_working STATION DailyCoolingDegreeDays DailyHeatingDegreeDays HourlyDryBulbTemperature AC_kW cum_AC_kW
Dates
2014-01-01 00:00:00 2096.0 2014-01-01 2014 1 1 0 Wednesday winter 1 non-working 72290023188 0.0 7.0 51.0 0.0 220992.227
2014-01-01 01:00:00 1986.0 2014-01-01 2014 1 1 1 Wednesday winter 1 non-working 72290023188 0.0 7.0 51.5 0.0 220992.227
2014-01-01 02:00:00 1936.0 2014-01-01 2014 1 1 2 Wednesday winter 1 non-working 72290023188 0.0 7.0 51.8 0.0 220992.227
2014-01-01 03:00:00 1896.0 2014-01-01 2014 1 1 3 Wednesday winter 1 non-working 72290023188 0.0 7.0 50.0 0.0 220992.227
2014-01-01 04:00:00 1899.0 2014-01-01 2014 1 1 4 Wednesday winter 1 non-working 72290023188 0.0 7.0 48.8 0.0 220992.227

2. Defining Basic Functions


Defining basic functions like calculating error metrics, plotting predicted vs original time series, etc. so that we can use the same functions seamlessly across different models and this will also allow us to compare different models using the same metrics.


In [4]:
# Creating an empty dict to save all the erros from different models
dict_error = dict()
In [5]:
# creating function for plotting predicted vs actual energy values
def plot_predvstrue_reg(pred, truth, model_name=None):
    """
    Plots the observed energy consumption against the predicted energy consumption
    """
    fig, ax = plt.subplots(1,1, figsize=(8,8))
    ax.scatter(truth, pred) 
    _ = plt.xlabel("Observed energy in MWH")
    _ = plt.ylabel("Predicted energy in MWH")
    _ = plt.title("Observed vs Predicted energy using model {}".format(model_name))
    _ = plt.xlim(1000, 5000)
    _ = plt.ylim(1000, 5000)
    #plotting 45 deg line to see how the prediction differs from the observed values
    x = np.linspace(*ax.get_xlim())
    _ = ax.plot(x, x)
In [6]:
def error_metrics(y_pred, y_truth, model_name = None, test = True):
    """
    Printing error metrics like RMSE (root mean square error), R2 score, 
    MAE (mean absolute error), MAPE (mean absolute % error). 
    
    y_pred: predicted values of y using the model model_name
    y_truth: observed values of y
    model_name: name of the model used for predictions
    test: if validating on test set, True; otherwise False for training set validation
    
    The function will print the RMSE, R2, MAE and MAPE error metrics for the model_name and also store the results along with 
    model_name in the dictionary dict_error so that we can compare all the models at the end.
    """
    if isinstance(y_pred, np.ndarray):
        y_pred = y_pred
    else:
        y_pred = y_pred.to_numpy()
        
    if isinstance(y_truth, np.ndarray):
        y_truth = y_truth
    else:
        y_truth = y_truth.to_numpy()
        
    print('\nError metrics for model {}'.format(model_name))
    
    RMSE = np.sqrt(mean_squared_error(y_truth, y_pred))
    print("RMSE or Root mean squared error: %.2f" % RMSE)
    
    # Explained variance score: 1 is perfect prediction

    R2 = r2_score(y_truth, y_pred)
    print('Variance score: %.2f' % R2 )

    MAE = mean_absolute_error(y_truth, y_pred)
    print('Mean Absolute Error: %.2f' % MAE)

    MAPE = (np.mean(np.abs((y_truth - y_pred) / y_truth)) * 100)
    print('Mean Absolute Percentage Error: %.2f %%' % MAPE)
    
    # Appending the error values along with the model_name to the dict
    if test:
        train_test = 'test'
    else:
        train_test = 'train'
    
    #df = pd.DataFrame({'model': model_name, 'RMSE':RMSE, 'R2':R2, 'MAE':MAE, 'MAPE':MAPE}, index=[0])
    name_error = ['model', 'train_test', 'RMSE', 'R2', 'MAE', 'MAPE']
    value_error = [model_name, train_test, RMSE, R2, MAE, MAPE]
    list_error = list(zip(name_error, value_error))
    
    for error in list_error:
        if error[0] in dict_error:
            # append the new number to the existing array at this slot
            dict_error[error[0]].append(error[1])
        else:
            # create a new array in this slot
            dict_error[error[0]] = [error[1]]
    #return(dict_error)
In [7]:
def plot_timeseries(ts, title = 'og', opacity = 1):
    """
    Plot plotly time series of any given timeseries ts
    """
    fig = go.Figure()

    fig.add_trace(go.Scatter(x = ts.index, y = ts.values, name = "observed",
                         line_color = 'lightslategrey', opacity = opacity))

    fig.update_layout(title_text = title,
                  xaxis_rangeslider_visible = True)
    fig.show()
In [8]:
def plot_ts_pred(og_ts, pred_ts, model_name=None, og_ts_opacity = 0.5, pred_ts_opacity = 0.5):
    """
    Plot plotly time series of the original (og_ts) and predicted (pred_ts) time series values to check how our model performs.
    model_name: name of the model used for predictions
    og_ts_opacity: opacity of the original time series
    pred_ts_opacity: opacity of the predicted time series
    """
    
    fig = go.Figure()

    fig.add_trace(go.Scatter(x = og_ts.index, y = np.array(og_ts.values), name = "Observed",
                         line_color = 'deepskyblue', opacity = og_ts_opacity))

    try:
        fig.add_trace(go.Scatter(x = pred_ts.index, y = pred_ts, name = model_name,
                         line_color = 'lightslategrey', opacity = pred_ts_opacity))
    except: #if predicted values are a numpy array they won't have an index
        fig.add_trace(go.Scatter(x = og_ts.index, y = pred_ts, name = model_name,
                         line_color = 'lightslategrey', opacity = pred_ts_opacity))


    #fig.add_trace(go)
    fig.update_layout(title_text = 'Observed test set vs predicted energy MWH values using {}'.format(model_name),
                  xaxis_rangeslider_visible = True)
    fig.show()
In [9]:
def train_test(data, test_size = 0.15, scale = False, cols_to_transform=None, include_test_scale=False):
    """
    
        Perform train-test split with respect to time series structure
        
        - df: dataframe with variables X_n to train on and the dependent output y which is the column 'SDGE' in this notebook
        - test_size: size of test set
        - scale: if True, then the columns in the -'cols_to_transform'- list will be scaled using StandardScaler
        - include_test_scale: If True, the StandardScaler fits the data on the training as well as the test set; if False, then
          the StandardScaler fits only on the training set.
        
    """
    df = data.copy()
    # get the index after which test set starts
    test_index = int(len(df)*(1-test_size))
    
    # StandardScaler fit on the entire dataset
    if scale and include_test_scale:
        scaler = StandardScaler()
        df[cols_to_transform] = scaler.fit_transform(df[cols_to_transform])
        
    X_train = df.drop('SDGE', axis = 1).iloc[:test_index]
    y_train = df.SDGE.iloc[:test_index]
    X_test = df.drop('SDGE', axis = 1).iloc[test_index:]
    y_test = df.SDGE.iloc[test_index:]
    
    # StandardScaler fit only on the training set
    if scale and not include_test_scale:
        scaler = StandardScaler()
        X_train[cols_to_transform] = scaler.fit_transform(X_train[cols_to_transform])
        X_test[cols_to_transform] = scaler.transform(X_test[cols_to_transform])
    
    return X_train, X_test, y_train, y_test

3. Simple Linear Regression models


Starting off with simple linear regression models to see how they perform on our time series dataset.

In [10]:
# creating categorical columns for linear regression 
cat_cols = ['year', 'month', 'day', 'hour', 'weekday', 'season', 'holiday', 'non_working']

for col in cat_cols:
    sdge[col] = sdge[col].astype('category')
In [11]:
sdge.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 43824 entries, 2014-01-01 00:00:00 to 2018-12-31 23:00:00
Data columns (total 16 columns):
SDGE                        43824 non-null float64
Date                        43824 non-null datetime64[ns]
year                        43824 non-null category
month                       43824 non-null category
day                         43824 non-null category
hour                        43824 non-null category
weekday                     43824 non-null category
season                      43824 non-null category
holiday                     43824 non-null category
non_working                 43824 non-null category
STATION                     43824 non-null int64
DailyCoolingDegreeDays      43824 non-null float64
DailyHeatingDegreeDays      43824 non-null float64
HourlyDryBulbTemperature    43824 non-null float64
AC_kW                       43824 non-null float64
cum_AC_kW                   43824 non-null float64
dtypes: category(8), datetime64[ns](1), float64(6), int64(1)
memory usage: 3.3 MB
In [12]:
# Preparing dummy columns for use in sklearn's linear regression 
sdge_lin = pd.get_dummies(sdge, drop_first = True)
In [13]:
sdge_lin.head(3)
Out[13]:
SDGE Date STATION DailyCoolingDegreeDays DailyHeatingDegreeDays HourlyDryBulbTemperature AC_kW cum_AC_kW year_2015 year_2016 ... hour_23 weekday_Monday weekday_Saturday weekday_Sunday weekday_Thursday weekday_Tuesday weekday_Wednesday season_winter holiday_1 non_working_working
Dates
2014-01-01 00:00:00 2096.0 2014-01-01 72290023188 0.0 7.0 51.0 0.0 220992.227 0 0 ... 0 0 0 0 0 0 1 1 1 0
2014-01-01 01:00:00 1986.0 2014-01-01 72290023188 0.0 7.0 51.5 0.0 220992.227 0 0 ... 0 0 0 0 0 0 1 1 1 0
2014-01-01 02:00:00 1936.0 2014-01-01 72290023188 0.0 7.0 51.8 0.0 220992.227 0 0 ... 0 0 0 0 0 0 1 1 1 0

3 rows × 85 columns

In [14]:
sdge_lin.columns
Out[14]:
Index(['SDGE', 'Date', 'STATION', 'DailyCoolingDegreeDays',
       'DailyHeatingDegreeDays', 'HourlyDryBulbTemperature', 'AC_kW',
       'cum_AC_kW', 'year_2015', 'year_2016', 'year_2017', 'year_2018',
       'month_2', 'month_3', 'month_4', 'month_5', 'month_6', 'month_7',
       'month_8', 'month_9', 'month_10', 'month_11', 'month_12', 'day_2',
       'day_3', 'day_4', 'day_5', 'day_6', 'day_7', 'day_8', 'day_9', 'day_10',
       'day_11', 'day_12', 'day_13', 'day_14', 'day_15', 'day_16', 'day_17',
       'day_18', 'day_19', 'day_20', 'day_21', 'day_22', 'day_23', 'day_24',
       'day_25', 'day_26', 'day_27', 'day_28', 'day_29', 'day_30', 'day_31',
       'hour_1', 'hour_2', 'hour_3', 'hour_4', 'hour_5', 'hour_6', 'hour_7',
       'hour_8', 'hour_9', 'hour_10', 'hour_11', 'hour_12', 'hour_13',
       'hour_14', 'hour_15', 'hour_16', 'hour_17', 'hour_18', 'hour_19',
       'hour_20', 'hour_21', 'hour_22', 'hour_23', 'weekday_Monday',
       'weekday_Saturday', 'weekday_Sunday', 'weekday_Thursday',
       'weekday_Tuesday', 'weekday_Wednesday', 'season_winter', 'holiday_1',
       'non_working_working'],
      dtype='object')

3.1. Statsmodels

Fitting linear regression using statsmodel

There are multiple external regressor terms, in addition to the time related terms, in the dataset like 'DailyCoolingDegreeDays', 'DailyHeatingDegreeDays', 'HourlyDryBulbTemperature', 'AC_kW', 'cum_AC_kW'. In this project we'll be focusing only on the 'HourlyDryBulbTemperature' and 'cum_AC_kW' columns because 'AC_kW' data has been captured cumulatively in the 'cum_AC_kW' column and the 'DailyCoolingDegreeDays' and 'DailyHeatingDegreeDays' columns are dependent on the daily temperature, so to avoid any correlation between the X variables we will use only the temperature data.

In [15]:
# Checking linear regression fit using statsmodels Linear regression

m = ols('SDGE ~  C(year) + C(month) + C(hour) + C(season) + C(non_working) + \
                 HourlyDryBulbTemperature + cum_AC_kW', sdge).fit()
print(m.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   SDGE   R-squared:                       0.766
Model:                            OLS   Adj. R-squared:                  0.766
Method:                 Least Squares   F-statistic:                     3495.
Date:                Mon, 09 Sep 2019   Prob (F-statistic):               0.00
Time:                        13:08:28   Log-Likelihood:            -3.0102e+05
No. Observations:               43824   AIC:                         6.021e+05
Df Residuals:                   43782   BIC:                         6.025e+05
Df Model:                          41                                         
Covariance Type:            nonrobust                                         
=============================================================================================
                                coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------
Intercept                   723.6973     27.211     26.596      0.000     670.363     777.031
C(year)[T.2015]             -27.9561     12.427     -2.250      0.024     -52.313      -3.599
C(year)[T.2016]            -103.2693     30.912     -3.341      0.001    -163.857     -42.682
C(year)[T.2017]            -128.4263     45.900     -2.798      0.005    -218.391     -38.461
C(year)[T.2018]            -150.1284     61.733     -2.432      0.015    -271.126     -29.131
C(month)[T.2]               -87.5614      5.654    -15.486      0.000     -98.644     -76.479
C(month)[T.3]              -174.4948      5.920    -29.475      0.000    -186.098    -162.891
C(month)[T.4]              -232.2749      6.582    -35.289      0.000    -245.176    -219.374
C(month)[T.5]              -186.1033      7.236    -25.718      0.000    -200.287    -171.920
C(month)[T.6]               -38.1530      4.938     -7.727      0.000     -47.831     -28.475
C(month)[T.7]               223.8704      6.164     36.321      0.000     211.790     235.951
C(month)[T.8]               300.7221      7.144     42.093      0.000     286.719     314.725
C(month)[T.9]               203.1271      8.016     25.341      0.000     187.416     218.838
C(month)[T.10]               -0.8837      8.784     -0.101      0.920     -18.100      16.333
C(month)[T.11]             -133.3559     13.330    -10.004      0.000    -159.483    -107.228
C(month)[T.12]               26.5865     14.680      1.811      0.070      -2.186      55.359
C(hour)[T.1]                -87.4917      7.707    -11.353      0.000    -102.597     -72.387
C(hour)[T.2]               -135.2684      7.709    -17.547      0.000    -150.378    -120.159
C(hour)[T.3]               -153.4311      7.712    -19.894      0.000    -168.548    -138.315
C(hour)[T.4]               -125.0735      7.714    -16.214      0.000    -140.193    -109.954
C(hour)[T.5]                -15.7209      7.715     -2.038      0.042     -30.843      -0.599
C(hour)[T.6]                140.3607      7.709     18.209      0.000     125.252     155.470
C(hour)[T.7]                211.5632      7.710     27.439      0.000     196.451     226.676
C(hour)[T.8]                236.7922      7.757     30.526      0.000     221.588     251.996
C(hour)[T.9]                240.1033      7.844     30.609      0.000     224.728     255.478
C(hour)[T.10]               258.8114      7.889     32.805      0.000     243.348     274.275
C(hour)[T.11]               251.3403      7.988     31.464      0.000     235.683     266.997
C(hour)[T.12]               273.9440      8.010     34.202      0.000     258.245     289.643
C(hour)[T.13]               318.8534      8.003     39.842      0.000     303.167     334.539
C(hour)[T.14]               377.3041      7.975     47.313      0.000     361.673     392.935
C(hour)[T.15]               454.2681      7.927     57.309      0.000     438.732     469.804
C(hour)[T.16]               555.6116      7.887     70.449      0.000     540.153     571.070
C(hour)[T.17]               718.5000      7.793     92.198      0.000     703.226     733.774
C(hour)[T.18]               770.3008      7.756     99.322      0.000     755.100     785.502
C(hour)[T.19]               786.2214      7.738    101.609      0.000     771.055     801.388
C(hour)[T.20]               758.7183      7.726     98.208      0.000     743.576     773.861
C(hour)[T.21]               603.2494      7.717     78.175      0.000     588.125     618.374
C(hour)[T.22]               365.4401      7.713     47.380      0.000     350.323     380.558
C(hour)[T.23]               147.0761      7.707     19.084      0.000     131.971     162.181
C(season)[T.winter]          35.0143      4.825      7.257      0.000      25.558      44.471
C(non_working)[T.working]   199.1226      2.402     82.888      0.000     194.414     203.831
HourlyDryBulbTemperature     19.2482      0.264     72.967      0.000      18.731      19.765
cum_AC_kW                  2.047e-05   9.29e-05      0.220      0.826      -0.000       0.000
==============================================================================
Omnibus:                     8542.099   Durbin-Watson:                   0.111
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            25224.607
Skew:                           1.021   Prob(JB):                         0.00
Kurtosis:                       6.106   Cond. No.                     2.08e+19
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 4.13e-23. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Plotting the observed vs predicted values for the above statsmodel linear regression model

In [16]:
plot_predvstrue_reg(m.fittedvalues, sdge.SDGE, model_name = 'Statsmodel Simple Linear Regression')
  • The R2 for such a simple model is surprisingly good (0.766). And we see that some of the most important variables are the hour values, season and we also see a slightly decreasing trend with years. i.e. the energy consumption slightly decreases from 2014 to 2018.
  • The model predicts very well till ~3000MWH but then it fails to predict the higher energy consumption values.
  • But we need to be careful before interpreting this model because the X variables weren't scaled and the model was fit on the entire dataset. We will solve these issues now in the upcoming models.

3.2. SkLearn Linear Regression

Fitting linear regression using sklearn

In [17]:
# Keeping only the necessary columns; day columns were removed because there is 0 to none monthly seasonality in the data    
sdge_lin.drop(['Date','STATION', 'AC_kW', 'DailyCoolingDegreeDays','DailyHeatingDegreeDays',           
              'weekday_Monday', 'weekday_Saturday', 'weekday_Sunday', 'weekday_Thursday',
              'weekday_Tuesday', 'weekday_Wednesday', 'holiday_1', 'day_2', 'day_3', 'day_4', 'day_5', 'day_6',
               'day_7', 'day_8', 'day_9', 'day_10', 'day_11', 'day_12', 'day_13',
               'day_14', 'day_15', 'day_16', 'day_17', 'day_18', 'day_19', 'day_20',
               'day_21', 'day_22', 'day_23', 'day_24', 'day_25', 'day_26', 'day_27',
               'day_28', 'day_29', 'day_30', 'day_31'], axis = 1, inplace = True)
In [18]:
sdge_lin.columns
Out[18]:
Index(['SDGE', 'HourlyDryBulbTemperature', 'cum_AC_kW', 'year_2015',
       'year_2016', 'year_2017', 'year_2018', 'month_2', 'month_3', 'month_4',
       'month_5', 'month_6', 'month_7', 'month_8', 'month_9', 'month_10',
       'month_11', 'month_12', 'hour_1', 'hour_2', 'hour_3', 'hour_4',
       'hour_5', 'hour_6', 'hour_7', 'hour_8', 'hour_9', 'hour_10', 'hour_11',
       'hour_12', 'hour_13', 'hour_14', 'hour_15', 'hour_16', 'hour_17',
       'hour_18', 'hour_19', 'hour_20', 'hour_21', 'hour_22', 'hour_23',
       'season_winter', 'non_working_working'],
      dtype='object')
In [19]:
# Creating the train and test data
# A test_size og 0.15 gives us a training set of 4 years and 3 months and a test set of 9 months (the test set includes 
#...winter and summer seasons.

cols_to_transform = ['HourlyDryBulbTemperature', 'cum_AC_kW'] # other columns are binary values
X_train, X_test, y_train, y_test = train_test(sdge_lin, test_size = 0.15, scale = True, cols_to_transform=cols_to_transform)

# This creates a LinearRegression object
lm = LinearRegression()
lm
Out[19]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [20]:
# Fitting the linear regression model
lm.fit(X_train, y_train)
Out[20]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [21]:
# Plotting the coefficients to check the importance of each coefficient 

# Plot the coefficients
_ = plt.figure(figsize = (16, 7))
_ = plt.plot(range(len(X_train.columns)), lm.coef_)
_ = plt.xticks(range(len(X_train.columns)), X_train.columns.values, rotation = 90)
_ = plt.margins(0.02)
_ = plt.axhline(0, linewidth = 0.5, color = 'r')
_ = plt.title('sklearn Simple linear regression coefficients')
_ = plt.ylabel('lm_coeff')
_ = plt.xlabel('Features')

  • The coefficent plot basically says that the energy values increase with temperature, are higher in summer months than winter months, are higher for working days than non-working days and decrease over the years from 2014-2018.
  • The hour variable is the most significant coeffecient where the energy increases significantly over the evening hours.
In [22]:
#X_train.describe().T[['min', 'max']]
In [23]:
# Plotting lm.predict(X) vs observed energy consumption

plot_predvstrue_reg(lm.predict(X_test), y_test, model_name = 'sklearn Simple linear regression')
  • We can see the model is under predicting the higher energy values.
In [24]:
error_metrics(lm.predict(X_train), y_train, model_name = 'Simple linear regression with scaling', test = False)
Error metrics for model Simple linear regression with scaling
RMSE or Root mean squared error: 230.15
Variance score: 0.77
Mean Absolute Error: 170.18
Mean Absolute Percentage Error: 7.13 %
In [25]:
# on test set
error_metrics(lm.predict(X_test), y_test, model_name = 'Simple linear regression with scaling', test = True)
Error metrics for model Simple linear regression with scaling
RMSE or Root mean squared error: 254.84
Variance score: 0.73
Mean Absolute Error: 194.84
Mean Absolute Percentage Error: 8.37 %
In [26]:
# Plotting the predicted values with the original time series (test set)
plot_ts_pred(y_test, lm.predict(X_test), model_name='Simple linear regression with scaling', 
             og_ts_opacity = 0.5, pred_ts_opacity = 0.5)
  • It can be seen that the model predicts the daily trend and seasonality pretty well but the high peaks are not captured by the model very well. It can be said the daily, weekly and yearly seasonalities were captured decently well by the model.
  • But from the error metrics on the train and test sets it can be observed that the model is stable and isn't overfitting.

3.2.1 Baseline Persistent Forecast

I had mentioned in the introduction that all the models will be compared to a baseline persistent model which is simply a repetition of the last year energy consumption values.

In [27]:
# Calculating the errors for persistent forecast (where we simply repeat the values from last year)
#y_test.index.year
_ = error_metrics(sdge_lin.loc[X_test.index.shift(-8760, freq='H'), 'SDGE'], 
                  y_test, model_name='Baseline Persistent forecast, repeat last year', test=True)
Error metrics for model Baseline Persistent forecast, repeat last year
RMSE or Root mean squared error: 330.74
Variance score: 0.55
Mean Absolute Error: 224.89
Mean Absolute Percentage Error: 9.23 %
In [28]:
plot_ts_pred(y_test, sdge_lin.loc[X_test.index.shift(-8760, freq='H'), 'SDGE'].to_numpy(),
             model_name='Baseline Persistent forecast, repeat last year')
In [29]:
# In the energy forecasting domain, it is even more important to get the daily max demand right. So, calculating the ...
#...error on that too

# Resampling both the y_test and predictions at a 24 hours period and using the max as the aggregate function
_ = error_metrics(sdge_lin.loc[X_test.index.shift(-8760, freq='H'), 'SDGE'].resample('24h').max(), 
                  y_test.resample('24h').max(), 
                  model_name='Baseline Persistent forecast, repeat last year; daily MAX', test=True)
Error metrics for model Baseline Persistent forecast, repeat last year; daily MAX
RMSE or Root mean squared error: 389.37
Variance score: 0.22
Mean Absolute Error: 264.44
Mean Absolute Percentage Error: 8.60 %

3.3. Ridge Regression

Fitting Ridge regression using sklearn (L2 regularization)

  • Trying linear regression with regularization using the L2 norm (Ridge regression)
In [30]:
# Trying regularization: Ridge regression
from sklearn.linear_model import Ridge

# Instantiate a ridge regressor: ridge
ridge = Ridge(alpha = 0.2, normalize = True) #selecting alpha=0.2 here. 
# Tried different values from 0.1 to 0.8, results don't change much

# Fit the regressor to the data
ridge.fit(X_train, y_train)

# Compute and print the coefficients
ridge_coef = ridge.coef_
#print(ridge_coef)

# Plot the coefficients
_ = plt.figure(figsize = (16, 7))
_ = plt.plot(range(len(X_train.columns)), ridge_coef)
_ = plt.xticks(range(len(X_train.columns)), X_train.columns.values, rotation = 90)
_ = plt.margins(0.02)
_ = plt.axhline(0, linewidth = 0.5, color = 'r')
_ = plt.title('significane of features as per Ridge regularization')
_ = plt.ylabel('Ridge coeff')
_ = plt.xlabel('Features')
################################################################################
  • The significance of features remains more or less same as observed for the simple linear regression model.
In [31]:
# PLotting the residuals
residuals = (y_test - ridge.predict(X_test))
_ = plt.figure(figsize=(7,7))
_ = plt.scatter(ridge.predict(X_test) , residuals, alpha = 0.5) 
_ = plt.xlabel("Model predicted energy values")
_ = plt.ylabel("Residuals")
_ = plt.title("Fitted values versus Residuals for Ridge regression")
In [32]:
print('Ridge regression on training set')
error_metrics(ridge.predict(X_train), y_train, model_name = 'Ridge regression with scaling', test = False)

print('\nRidge regression on test set')
error_metrics(ridge.predict(X_test), y_test, model_name = 'Ridge regression with scaling', test = True)
Ridge regression on training set

Error metrics for model Ridge regression with scaling
RMSE or Root mean squared error: 242.50
Variance score: 0.74
Mean Absolute Error: 176.28
Mean Absolute Percentage Error: 7.34 %

Ridge regression on test set

Error metrics for model Ridge regression with scaling
RMSE or Root mean squared error: 262.81
Variance score: 0.72
Mean Absolute Error: 194.25
Mean Absolute Percentage Error: 8.18 %
In [33]:
# Plotting the observed test energy and predicted energy data on the same graph as line plots

plot_ts_pred(y_test, ridge.predict(X_test), model_name='Ridge regression', og_ts_opacity = 0.5, pred_ts_opacity = 0.5)
  • Thus, we can see that the ridge regression and simple regression without any regularization perform decently well on the test set and the models don't overfit capturing the overall trend and seasonality pretty well.
  • The residuals as expected are not normally distributed and are biased towards the ends because the model doesn't predict the extreme values very well, especially the high energy consumption in the summer.

3.4. Reducing the feature space

Trying different columns this time

  • In the above models we had used hot encoded variables like hour_1, hour_2,..., month_2, month_3,.. etc. This results in a loss of information because the model assumes that 23rd hour is far away from the 0th hour (and same for months, month_12 is far away from month_1) which is not the case because the time series is periodic and the 0th hour is as much closer to the 23rd hour as it is to the 1st hour. To avoid the loss of information and to lower the number of predictor variables we'll try some feature engineering on the X space.

  • We'll also emit the month and weekday columns because the season column captures the months feature anyways. So the number of X variables used in the models below will be considerably lower than the above models.

In [34]:
# Creating a new dataframe similar to the first one
sdge1 = pd.read_csv('hourly1418_energy_temp_PV.csv', index_col = 'Dates', parse_dates=['Dates', 'Date'])
In [35]:
sdge1.head()
Out[35]:
SDGE Date year month day hour weekday season holiday non_working STATION DailyCoolingDegreeDays DailyHeatingDegreeDays HourlyDryBulbTemperature AC_kW cum_AC_kW
Dates
2014-01-01 00:00:00 2096.0 2014-01-01 2014 1 1 0 Wednesday winter 1 non-working 72290023188 0.0 7.0 51.0 0.0 220992.227
2014-01-01 01:00:00 1986.0 2014-01-01 2014 1 1 1 Wednesday winter 1 non-working 72290023188 0.0 7.0 51.5 0.0 220992.227
2014-01-01 02:00:00 1936.0 2014-01-01 2014 1 1 2 Wednesday winter 1 non-working 72290023188 0.0 7.0 51.8 0.0 220992.227
2014-01-01 03:00:00 1896.0 2014-01-01 2014 1 1 3 Wednesday winter 1 non-working 72290023188 0.0 7.0 50.0 0.0 220992.227
2014-01-01 04:00:00 1899.0 2014-01-01 2014 1 1 4 Wednesday winter 1 non-working 72290023188 0.0 7.0 48.8 0.0 220992.227
In [36]:
# Dividing the hours into 4 groups-> night, morning, afternoon, evening

hour_dict = {'morning': list(np.arange(7,13)),'afternoon': list(np.arange(13,16)), 'evening': list(np.arange(16,22)),
            'night': [22, 23, 0, 1, 2, 3, 4, 5, 6]}
In [37]:
hour_dict
Out[37]:
{'morning': [7, 8, 9, 10, 11, 12],
 'afternoon': [13, 14, 15],
 'evening': [16, 17, 18, 19, 20, 21],
 'night': [22, 23, 0, 1, 2, 3, 4, 5, 6]}
In [38]:
def time_of_day(x):
    if x in hour_dict['morning']:
        return 'morning'
    elif x in hour_dict['afternoon']:
        return 'afternoon'
    elif x in hour_dict['evening']:
        return 'evening'
    else:
        return 'night'
In [39]:
sdge1['time_of_day'] = sdge1['hour'].apply(time_of_day)
In [40]:
sdge1.head()
Out[40]:
SDGE Date year month day hour weekday season holiday non_working STATION DailyCoolingDegreeDays DailyHeatingDegreeDays HourlyDryBulbTemperature AC_kW cum_AC_kW time_of_day
Dates
2014-01-01 00:00:00 2096.0 2014-01-01 2014 1 1 0 Wednesday winter 1 non-working 72290023188 0.0 7.0 51.0 0.0 220992.227 night
2014-01-01 01:00:00 1986.0 2014-01-01 2014 1 1 1 Wednesday winter 1 non-working 72290023188 0.0 7.0 51.5 0.0 220992.227 night
2014-01-01 02:00:00 1936.0 2014-01-01 2014 1 1 2 Wednesday winter 1 non-working 72290023188 0.0 7.0 51.8 0.0 220992.227 night
2014-01-01 03:00:00 1896.0 2014-01-01 2014 1 1 3 Wednesday winter 1 non-working 72290023188 0.0 7.0 50.0 0.0 220992.227 night
2014-01-01 04:00:00 1899.0 2014-01-01 2014 1 1 4 Wednesday winter 1 non-working 72290023188 0.0 7.0 48.8 0.0 220992.227 night
In [41]:
# creating categorical columns for linear regression 
cat_cols1 = ['month', 'day', 'hour', 'weekday', 'season', 'holiday', 'non_working', 'time_of_day']
#not including year above to capture the decreasing energy trend over increasing value of years
for col in cat_cols1:
    sdge1[col] = sdge1[col].astype('category')
In [42]:
# Columns to use for regression
cols_use = ['SDGE', 'year', 'time_of_day', 'season', 'non_working', 'HourlyDryBulbTemperature', 'cum_AC_kW']

#['Date','STATION','DailyCoolingDegreeDays','DailyHeatingDegreeDays', 'AC_kW', \
#              'weekday_Monday', 'weekday_Saturday', 'weekday_Sunday', 'weekday_Thursday',\
 #             'weekday_Tuesday', 'weekday_Wednesday', 'holiday_1']
In [43]:
sdge1_lin = pd.get_dummies(sdge1[cols_use], drop_first = True)
In [44]:
sdge1_lin.head()
Out[44]:
SDGE year HourlyDryBulbTemperature cum_AC_kW time_of_day_evening time_of_day_morning time_of_day_night season_winter non_working_working
Dates
2014-01-01 00:00:00 2096.0 2014 51.0 220992.227 0 0 1 1 0
2014-01-01 01:00:00 1986.0 2014 51.5 220992.227 0 0 1 1 0
2014-01-01 02:00:00 1936.0 2014 51.8 220992.227 0 0 1 1 0
2014-01-01 03:00:00 1896.0 2014 50.0 220992.227 0 0 1 1 0
2014-01-01 04:00:00 1899.0 2014 48.8 220992.227 0 0 1 1 0
In [45]:
# Creating the train and test data
# Since we had seen in the above models that the energy consumption decreases over the years from 14-18, the year columns was
#..not categorized here and instead used as a regular reressor term

cols_to_transform = ['HourlyDryBulbTemperature', 'cum_AC_kW', 'year']
X_train, X_test, y_train, y_test = train_test(sdge1_lin, test_size = 0.15, scale = True, cols_to_transform=cols_to_transform, 
                                              include_test_scale=False)

3.5. Elastic Net Regression

Trying Elastic net regression on the above reduced X space

In [46]:
# Trying elastic net regression

# Import necessary modules
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV

def trend_model(data, cols_to_transform, l1_space, alpha_space, cols_use = 0, scale = True, test_size = 0.15, 
                include_test_scale=False):
    """
    Tuning, fitting and predicting with an Elastic net regression model.
    data: time series dataframe including X and y variables
    col_use: columns including the y variable to be used from the data
    cols_to_transform: columns to be scaled using StandardScaler if scale = True
    l1_space: potential values to try for the l1_ratio parameter of the elastic net regression
    include_test_scale: If True then the StandardScaler will be fit on the entire dataset instead of just the training set
    
    A note about l1_ratio: The ElasticNet mixing parameter, with 0 <= l1_ratio <= 1. 
    For l1_ratio = 0 the penalty is an L2 penalty. For l1_ratio = 1 it is an L1 penalty. 
    For 0 < l1_ratio < 1, the penalty is a combination of L1 and L2.
    """
    
    # Creating the train test split
    if cols_use != 0:
        df = data[cols_use]
    else:
        df = data
    
    X_train, X_test, y_train, y_test = train_test(df, test_size = test_size, 
                                              scale = scale, cols_to_transform=cols_to_transform, 
                                              include_test_scale=include_test_scale)

    
    # Create the hyperparameter grid
    #l1_space = np.linspace(0, 1, 50)
    param_grid = {'l1_ratio': l1_space, 'alpha': alpha_space}

    # Instantiate the ElasticNet regressor: elastic_net
    elastic_net = ElasticNet()

    # for time-series cross-validation set 5 folds
    tscv = TimeSeriesSplit(n_splits=5)

    # Setup the GridSearchCV object: gm_cv ...trying 5 fold cross validation 
    gm_cv = GridSearchCV(elastic_net, param_grid, cv = tscv)

    # Fit it to the training data
    gm_cv.fit(X_train, y_train)

    # Predict on the test set and compute metrics
    y_pred = gm_cv.predict(X_test)
    r2 = gm_cv.score(X_test, y_test)
    mse = mean_squared_error(y_test, y_pred)
    print("Tuned ElasticNet l1 ratio: {}".format(gm_cv.best_params_))
    print("Tuned ElasticNet R squared: {}".format(r2))
    print("Tuned ElasticNet RMSE: {}".format(np.sqrt(mse)))
    # fitting the elastic net again using the best model from above

    elastic_net_opt = ElasticNet(l1_ratio = gm_cv.best_params_['l1_ratio']) 
    elastic_net_opt.fit(X_train, y_train)
    
    # Plot the coefficients
    _ = plt.figure(figsize = (15, 7))
    _ = plt.plot(range(len(X_train.columns)), elastic_net_opt.coef_)
    _ = plt.xticks(range(len(X_train.columns)), X_train.columns.values, rotation = 90)
    _ = plt.margins(0.02)
    _ = plt.axhline(0, linewidth = 0.5, color = 'r')
    _ = plt.title('significane of features as per Elastic regularization')
    _ = plt.ylabel('Elastic net coeff')
    _ = plt.xlabel('Features')
    
    # Plotting y_true vs predicted
    _ = plt.figure(figsize = (5,5))
    _ = plot_predvstrue_reg(elastic_net_opt.predict(X_test), y_test, model_name = 'Elastic net optimal linear regression')
    
    # returns the train and test X and y sets and also the optimal model
    return X_train, X_test, y_train, y_test, elastic_net_opt
In [47]:
data = sdge1_lin
cols_to_transform = ['HourlyDryBulbTemperature', 'cum_AC_kW', 'year']
l1_space = np.linspace(0, 1, 30)
alpha_space = np.logspace(-2, 0, 30)
In [48]:
# Fitting, tuning and predicting using the best elastic net regression model
import warnings  
warnings.filterwarnings('ignore')
X_train, X_test, y_train, y_test, elastic_net_opt = trend_model(data=data, cols_to_transform=cols_to_transform, 
                                                                l1_space=l1_space, alpha_space=alpha_space,
                                                                scale = True, test_size = 0.15, include_test_scale=False)
Tuned ElasticNet l1 ratio: {'alpha': 1.0, 'l1_ratio': 0.9655172413793103}
Tuned ElasticNet R squared: 0.5993895343153239
Tuned ElasticNet RMSE: 311.883248051392
<Figure size 360x360 with 0 Axes>
In [49]:
# Plotting the observed test energy and predicted energy data on the same graph as line plots

plot_ts_pred(y_test, elastic_net_opt.predict(X_test), model_name='Optimal Elastic net regression', \
             og_ts_opacity = 0.5, pred_ts_opacity = 0.5)
In [50]:
# printing the error metrics
print('Elastic net regression on training set')
error_metrics(elastic_net_opt.predict(X_train), y_train, model_name = 'Tuned elastic net regression with reduced hour space', 
              test = False)

print('\nElastic net regression on test set')
error_metrics(elastic_net_opt.predict(X_test), y_test, model_name = 'Tuned elastic net regression with reduced hour space', 
              test = True)
Elastic net regression on training set

Error metrics for model Tuned elastic net regression with reduced hour space
RMSE or Root mean squared error: 277.96
Variance score: 0.66
Mean Absolute Error: 212.49
Mean Absolute Percentage Error: 9.00 %

Elastic net regression on test set

Error metrics for model Tuned elastic net regression with reduced hour space
RMSE or Root mean squared error: 311.88
Variance score: 0.60
Mean Absolute Error: 236.89
Mean Absolute Percentage Error: 9.98 %
  • The optimized model here has a l1_ratio which is close to 1 which means it is almost like a Lasso regression model ($L1$ regularization).
  • We see that the regression model's performance decreases by clubbing the hours into larger bins. In fact in terms of MAPE, it performs poorer than the baseline model on the test set.

4. Random Forest Regression


Trying Random Forest regression on the above reduced X space (clubbing hours into larger bins)

  • Usually the Random forest doesn't perform very well on a time series problem and most often than not it will overfit the data and predict poorly on the test set. But the performane of random forest on a time series problem can be improved by some feature engineering.
  • Let's try to fit random forest on the basic reduced space dataset that we fitted our tuned elastic model on.
In [51]:
from sklearn.ensemble import RandomForestRegressor
In [52]:
# Tuning Random forest
# n_estimators = number of trees in the forest
# max_features = max number of features considered for splitting a node

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(10, 200, 10, endpoint=True)]
max_features = ['auto', 'sqrt']
max_depth = list(range(1,6))
# Create the random grid
random_grid = {'n_estimators': n_estimators, 'max_features': max_features, 'max_depth':max_depth}
print(random_grid)
{'n_estimators': [10, 31, 52, 73, 94, 115, 136, 157, 178, 200], 'max_features': ['auto', 'sqrt'], 'max_depth': [1, 2, 3, 4, 5]}
In [53]:
#import randomsearchcv
from sklearn.model_selection import RandomizedSearchCV

# First create the base model to tune
rf = RandomForestRegressor()

# Creating a time series split as discussed in the Introduction
tscv = TimeSeriesSplit(n_splits=5)
# Random search of parameters
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, 
                               cv = tscv, verbose=2, random_state = 42, n_jobs = -1)

# Fit the random search model
rf_random.fit(X_train, y_train)

rf_random.best_params_
#rf.fit(X_train, y_train)
Fitting 5 folds for each of 10 candidates, totalling 50 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   12.9s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:   13.8s finished
Out[53]:
{'n_estimators': 10, 'max_features': 'sqrt', 'max_depth': 4}
In [54]:
rf_random.score(X_train, y_train)
Out[54]:
0.7236018480252113
In [55]:
rf_random.score(X_test, y_test) # expect to see overfitting here
Out[55]:
0.6582355552287928
In [56]:
# Plotting the observed test energy and predicted energy data on the same graph as line plots

plot_ts_pred(y_test, rf_random.predict(X_test), model_name='Tuned Random forest with reduced hour space', \
             og_ts_opacity = 0.5, pred_ts_opacity = 0.5)
In [57]:
# Random forest error metrics

print('Tuned Random forest errors on training set')
error_metrics(rf_random.predict(X_train), y_train, model_name = 'Tuned Random forest with reduced hour space', test = False)
print('\nTuned Random forest errors on test set')
error_metrics(rf_random.predict(X_test), y_test, model_name = 'Tuned Random forest with reduced hour space', test = True)
Tuned Random forest errors on training set

Error metrics for model Tuned Random forest with reduced hour space
RMSE or Root mean squared error: 251.79
Variance score: 0.72
Mean Absolute Error: 193.41
Mean Absolute Percentage Error: 8.20 %

Tuned Random forest errors on test set

Error metrics for model Tuned Random forest with reduced hour space
RMSE or Root mean squared error: 288.07
Variance score: 0.66
Mean Absolute Error: 223.08
Mean Absolute Percentage Error: 9.58 %
  • The above RF model performs similar to the elastic net regression.

5. Adding lags as X variables (Elastic Net & RF)


Adding lags to our data

What makes the time series different from other datasets is that the current y value depends on the previous N values in time depending on the correlation of the data with its lagged version. For example, today's outside temperature can depend on yesterday's outside temperature or maybe depend on the last 5 days of temperature values. The energy consumption values can also be expected to depend on it's previous lagged values because energy consumption of a region shouldn't be expected to change much except for any unexpected or unfortunate events. So we will add the lagged values of energy consumtpion as the X parameters and check if we can predict better using the past values.

In [58]:
sdge1_lin.head(2)
Out[58]:
SDGE year HourlyDryBulbTemperature cum_AC_kW time_of_day_evening time_of_day_morning time_of_day_night season_winter non_working_working
Dates
2014-01-01 00:00:00 2096.0 2014 51.0 220992.227 0 0 1 1 0
2014-01-01 01:00:00 1986.0 2014 51.5 220992.227 0 0 1 1 0
In [59]:
# Adding max 24 lags; lag1 is the value of the energy consumption in the previous hour, lag2 is the value of energy consumption..
#..2 hours before the current value and so on.

# Creating the lag variables
for i in range(24):
    sdge1_lin['lag'+str(i+1)] = sdge1_lin['SDGE'].shift(i+1)
In [60]:
# Since the first 24 values won't have any 24th lag, they will be NaN. So dropping the NaNs
lag_sdge = sdge1_lin.dropna()
In [61]:
lag_sdge.head(2)
Out[61]:
SDGE year HourlyDryBulbTemperature cum_AC_kW time_of_day_evening time_of_day_morning time_of_day_night season_winter non_working_working lag1 ... lag15 lag16 lag17 lag18 lag19 lag20 lag21 lag22 lag23 lag24
Dates
2014-01-02 00:00:00 1904.0 2014 53.0 221141.892 0 0 1 1 1 2041.0 ... 2250.0 2141.0 2055.0 2011.0 1922.0 1899.0 1896.0 1936.0 1986.0 2096.0
2014-01-02 01:00:00 1855.0 2014 52.0 221141.892 0 0 1 1 1 1904.0 ... 2304.0 2250.0 2141.0 2055.0 2011.0 1922.0 1899.0 1896.0 1936.0 1986.0

2 rows × 33 columns

Fitting Elastic Net Regression on the Lag data

In [62]:
# Fitting Elastic Net Regression model using the lag variables as the additional inputs ...
#...(extending the reduced space model)

# Not tuning the elastic net this time because we will see that there hardly any scope for improvement in the excellent results
#..we get

cols_to_transform = ['HourlyDryBulbTemperature', 'cum_AC_kW', 'year']
# Adding the energy consumption lags to the columns to transform 
list_lags = ['lag'+str(i+1) for i in range(24)]
cols_to_transform.extend(list_lags)

X_train_lag, X_test_lag, y_train_lag, y_test_lag = train_test(lag_sdge, \
                                                              test_size = 0.15, scale = True, \
                                                              cols_to_transform=cols_to_transform)
In [63]:
elastic_net_lag = ElasticNet(l1_ratio = 1, alpha=0.2)
elastic_net_lag.fit(X_train_lag, y_train_lag)
Out[63]:
ElasticNet(alpha=0.2, copy_X=True, fit_intercept=True, l1_ratio=1,
           max_iter=1000, normalize=False, positive=False, precompute=False,
           random_state=None, selection='cyclic', tol=0.0001, warm_start=False)
In [64]:
# Plot the coefficients
_ = plt.figure(figsize = (16, 7))
_ = plt.plot(range(len(X_train_lag.columns)), elastic_net_lag.coef_)
_ = plt.xticks(range(len(X_train_lag.columns)), X_train_lag.columns.values, rotation = 90)
_ = plt.margins(0.02)
_ = plt.axhline(0, linewidth = 0.5, color = 'r')
_ = plt.title('significane of features as per Elastic regularization with scaling and including lags')
_ = plt.ylabel('Linear reg coeff')
_ = plt.xlabel('Features')
In [65]:
X_train_lag.columns.values
Out[65]:
array(['year', 'HourlyDryBulbTemperature', 'cum_AC_kW',
       'time_of_day_evening', 'time_of_day_morning', 'time_of_day_night',
       'season_winter', 'non_working_working', 'lag1', 'lag2', 'lag3',
       'lag4', 'lag5', 'lag6', 'lag7', 'lag8', 'lag9', 'lag10', 'lag11',
       'lag12', 'lag13', 'lag14', 'lag15', 'lag16', 'lag17', 'lag18',
       'lag19', 'lag20', 'lag21', 'lag22', 'lag23', 'lag24'], dtype=object)

Fitting Random Forest on the lag data

In [66]:
#rflag = RandomForestRegressor()
#rflag.fit()

# First create the base model to tune
rf = RandomForestRegressor()
tscv = TimeSeriesSplit(n_splits=5)
# Random search of parameters
rflag = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, 
                               cv = tscv, verbose=2, random_state = 42, n_jobs = -1)

# Fit the random search model
rflag.fit(X_train_lag, y_train_lag)

rflag.best_params_
#rf.fit(X_train, y_train)
Fitting 5 folds for each of 10 candidates, totalling 50 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   54.5s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:   58.9s finished
Out[66]:
{'n_estimators': 73, 'max_features': 'auto', 'max_depth': 5}
In [67]:
# comparison of scores using elastic net and RF on the lag dataset

print('Random forest errors with lags on training')
error_metrics(rflag.predict(X_train_lag), y_train_lag, model_name = 'Random forest with all lags', test=False)
print('\nRandom forest errors with lags on test')
error_metrics(rflag.predict(X_test_lag), y_test_lag, model_name = 'Random forest with all lags', test=True)
print('\nElastic net errors with lags on train')
error_metrics(elastic_net_lag.predict(X_train_lag), y_train_lag, model_name = 'Elastic net with all lags', test=False)
print('\nElastic net errors with lags on test')
error_metrics(elastic_net_lag.predict(X_test_lag), y_test_lag, model_name = 'Elastic net with all lags', test=True)
Random forest errors with lags on training

Error metrics for model Random forest with all lags
RMSE or Root mean squared error: 79.51
Variance score: 0.97
Mean Absolute Error: 60.53
Mean Absolute Percentage Error: 2.56 %

Random forest errors with lags on test

Error metrics for model Random forest with all lags
RMSE or Root mean squared error: 83.82
Variance score: 0.97
Mean Absolute Error: 63.70
Mean Absolute Percentage Error: 2.73 %

Elastic net errors with lags on train

Error metrics for model Elastic net with all lags
RMSE or Root mean squared error: 51.59
Variance score: 0.99
Mean Absolute Error: 36.36
Mean Absolute Percentage Error: 1.51 %

Elastic net errors with lags on test

Error metrics for model Elastic net with all lags
RMSE or Root mean squared error: 50.75
Variance score: 0.99
Mean Absolute Error: 37.36
Mean Absolute Percentage Error: 1.58 %
In [68]:
plot_predvstrue_reg(elastic_net_lag.predict(X_test_lag), y_test_lag, 
                    model_name = 'Elastic Net Regression with all lags test set')
In [69]:
plot_predvstrue_reg(rflag.predict(X_test_lag), y_test_lag, 
                    model_name = 'Random Forest Regression with all lags test set')
In [70]:
# Plotting the observed test values, and predicted test values using the above Elastic net and Random forest models using lags
#..up to last 24 hour values

fig = go.Figure()

fig.add_trace(go.Scatter(x = X_test_lag.index, y = np.array(y_test_lag), name = "observed",
                         line_color = 'deepskyblue', opacity = 0.5))

fig.add_trace(go.Scatter(x = X_test_lag.index, y = rflag.predict(X_test_lag), name = "Random forest predictions",
                         line_color = 'lightslategrey', opacity = 0.5))

fig.add_trace(go.Scatter(x = X_test_lag.index, y = elastic_net_lag.predict(X_test_lag), name = "Elastic net predictions",
                         line_color = 'lightcoral', opacity = 0.5))

fig.update_layout(title_text = 'Observed test set vs Predicted energy using Random forest & elastic net \
reg using lags upto 24h on test set',
                  xaxis_rangeslider_visible = True)
fig.show()
  • Both Elastic net and Random Forest models give excellent performance by using all lags up to past 24 hours. Elastic net seems to perform better than Random Forest here and since the independent variables like cum_AC_kW which is the cumulative solar panel installations in kW till current date and also the Hourly Bulb temperature are linearly correlated with the energy consumption the first preference should be a linear model like the elastic net regression.
  • From the coefficient plot it can be seen how the first and the 23rd lags are the most important predictors.
  • This model performs very well but it comes with a limitation- we can use it to predict only the next hour value. That is, the maximum time window it can predict accurately is 1 hour. So, if that is the application case then the elastic net model with previous lag values should be used.


6. Time Series Features and Models


6.1. Basic Time series structure, Stationarity, Auto Correlation


Here is a good article explaining the nitty gritties of a time series data in detail : link

In [71]:
# Creating a simple time series dataframe
ts_sdge = pd.DataFrame(sdge1_lin["SDGE"], columns=['SDGE'])
In [72]:
ts_sdge.tail()
Out[72]:
SDGE
Dates
2018-12-31 19:00:00 2609.685890
2018-12-31 20:00:00 2504.283858
2018-12-31 21:00:00 2389.460971
2018-12-31 22:00:00 2259.741640
2018-12-31 23:00:00 2123.183153
In [73]:
# Plotting our original time series
plot_timeseries(ts_sdge['SDGE'], title = 'Original data set')

Let's decompose the time series to identify its multiple components

In [74]:
from statsmodels.graphics import tsaplots
In [75]:
import statsmodels.api as sm

# Run seasonal decompose
decomp = sm.tsa.seasonal_decompose(ts_sdge['SDGE'], freq=24*365) # capturing the yearly seasonal component; i.e. for example 
#... every July the energy consumption is high and then it gets lower during the winter months. 
#...So, a periodicity of 24hours*365 was used here.

print(decomp.seasonal.head()) # checking the seasonal component
_ = decomp.plot()
Dates
2014-01-01 00:00:00   -281.350242
2014-01-01 01:00:00   -386.023111
2014-01-01 02:00:00   -448.506518
2014-01-01 03:00:00   -478.883298
2014-01-01 04:00:00   -469.076471
Name: SDGE, dtype: float64
  • The sm.tsa.seasonal_decompose does a pretty good job of decomposing our energy consumption dataset, especially when we tell it that our data has a periodicity of 24*365 hours.
  • The trend shows how the energy consumption decreases over the years from 2014-18 which we had already inferred from the linear regression coefficient values for the year variable from the linear regression models. But, another important thing to keep in mind here is that the trend, though observably clear, is very small.
  • The seasonal component captures the summer and winter trends very well too.
In [76]:
ts_sdge['seasonal'] = decomp.seasonal
In [77]:
# Plotting the seasonal coponent only
plot_timeseries(ts_sdge['seasonal'], title = 'Seasonal component')
In [78]:
ts_sdge['trend'] = decomp.trend
In [79]:
# Plotting the trend (the starting 6 months and the end 6 months of data is NaN because we had set the periodicity as 24*365,
#... so the trend data will have 8760 less values)
ts_sdge['trend'].dropna().plot()
Out[79]:
<matplotlib.axes._subplots.AxesSubplot at 0x24389c7a908>

We can also check the trend using the pandas rolling method

In [80]:
# Plotting the half yearly rolling average of the time series to check trend (mean consumption)

_ = sdge['SDGE'].rolling(window = 24*30*12).mean().plot(figsize=(12,5))
_ = plt.title('Checking trend in the data by averaging yearly values')
In [81]:
# Plotting the quartely rolling MAX of the time series to check trend
_ = sdge['SDGE'].rolling(window = 24*30*12).max().plot(figsize=(12,5))
_ = plt.title('Checking trend in the data by taking the MAX of yearly values')
  • We can observe a downward trend in the data as observed using the decompose method above

Stationarity

Before we move forward we need to discuss an important property of the time series data which is Stationarity.

  • If a process is stationary, that means it does not change its statistical properties over time, namely its mean and variance. (The latter is called homoscedasticity). The covariance function does not depend on time; it should only depend on the distance between observations.
  • Why is stationarity so important? Because it is easy to make predictions on a stationary series since we can assume that the future statistical properties will not be different from those currently observed. Most of the time-series models, in one way or the other, try to predict those properties (mean or variance, for example). Future predictions would be wrong if the original series were not stationary.

From the above definition of stationarity it is clear that our energy consumption dataset is not stationary because it has a trend (changing mean) and also seasonality which means that the covariance function does depend on time. The simplest way to identify stationarity (or its absence) is viewing the data as we did above by decomposing it but the most common way of testing a dataset for stationarity is the Dicky Fuller test which tests for a unit root. Basically, if the p-value of the test is too small (say less than 0.05, giving us 95% confidence) then we reject the hypothesis that our data is non-stationary and assume that it is stationary indeed.


And the most common way of dealing with stationarity is differencing our dataset. Let's see some examples below.

In [82]:
# Differencing the data once

decomp_diff1 = sm.tsa.seasonal_decompose(ts_sdge['SDGE'].diff().dropna(), freq=24*365) 
_ = decomp_diff1.plot()
  • Comparing the above differenced plot with the previous original data plot we can see how the trend was removed by differencing just once and also the effect of seasonality was also reduced.

  • To deal with seasonality we can difference the dataset with the periodicity of the seasons. Let's difference the dataset by its smallest period which is 24 hours here.
In [83]:
# Differencing the data with periodicity of 24 hours in addition to the original single differencing

decomp_diff24 = sm.tsa.seasonal_decompose(ts_sdge['SDGE'].diff().dropna().diff(24).dropna(), freq=24*365) 
_ = decomp_diff24.plot()
  • Differencing the dataset once and then with the period of 24 hours further diminishes any trend and seasonality effects and seems to render the dataset stationary.

Conducting the Dicky Fuller test of Stationarity

In [84]:
from statsmodels.tsa.stattools import adfuller

Testing for stationarity

In [85]:
def run_adfuller(ts):
    result = adfuller(ts)
    # Print test statistic
    print("t-stat", result[0])
    # Print p-value
    print("p-value", result[1])
    # Print #lags used
    print("#lags used", result[2])
    # Print critical values
    print("critical values", result[4]) 
In [86]:
print("for no differencing\n")
run_adfuller(ts_sdge['SDGE'])
print("\nfor single differencing\n")
run_adfuller(ts_sdge['SDGE'].diff().dropna())
print("\nfor differenced data set over lags 24 after single differencing\n")
run_adfuller(ts_sdge['SDGE'].diff().dropna().diff(24).dropna())
for no differencing

t-stat -10.753363911381065
p-value 2.6306348577679226e-19
#lags used 55
critical values {'1%': -3.4304994170070295, '5%': -2.861606039041269, '10%': -2.5668051504336344}

for single differencing

t-stat -31.881870452429034
p-value 0.0
#lags used 55
critical values {'1%': -3.430499420421149, '5%': -2.8616060405501966, '10%': -2.566805151236794}

for differenced data set over lags 24 after single differencing

t-stat -44.35188189610505
p-value 0.0
#lags used 55
critical values {'1%': -3.4304995024068514, '5%': -2.861606076785167, '10%': -2.5668051705236494}
  • Based on the p-values of the above tests it seems that even our original dataset is stationary but if we check the t-stat values it is obvious that the stationarity gets more significant with differencing. The original dataset was declared stationary by the test maybe because the trend in our data is very weak (as evident from the decomposed plot). So, we can either use the original dataset as it is with the time series models (as we have done for the linear regression models before), or to be more robust we can use the single differencing to remove the trend and fit our models on the detrended data.

Exponential smoothing

Simple techniques like rolling average of the past values is often useful to visualize the trend in a dataset. The simplest model for a time series will be a persistent model where simply $y_t = y_{t-1}$. Most of the times the time series models are compared to this naive model to test their performance. A more advance smoothing technique is called exponential smoothing and it is given as:

$$\hat{y}_{t} = \alpha \cdot y_t + (1-\alpha) \cdot \hat y_{t-1} $$

Here the model value is a weighted average between the current true value and the previous model values. The $\alpha$ weight is called a smoothing factor. It defines how quickly we will "forget" the last available true observation. The smaller $\alpha$ is, the more influence the previous observations have and the smoother the series is.

In [87]:
# Defining a function to plot exponentially smoothed data
def plot_ewma(ts, alpha):
    expw_ma = ts.ewm(alpha=alpha).mean()
    
    plot_ts_pred(ts, expw_ma, model_name='Exponentially smoothed data with alpha = {}'.format(alpha), \
                 og_ts_opacity = 0.5, pred_ts_opacity = 0.5)
In [88]:
# Plotting the exponentially smoothed energy consumption data with alpha=0.3
plot_ewma(ts_sdge['SDGE'], 0.3)

The EWMA does a good job of modeling the future values but it requires N number of lag variables depending on the value of alpha and also it's a very simplistic model for our energy consumption data.

There are other advanced models like the Triple exponential smoothing a.k.a Holt Winters model but I am skipping that model here because we can try better models like FB Prophet and ARIMA on our data (which seems to have a trend and multiple seasonality).

In [ ]:
 

Correlation with the past values

The correlation of the time series observations calculated with values of the same series at previous times, is called a serial correlation, or an autocorrelation (ACF). It is used to determine the moving average (MA or q) term of the ARIMA(p,d,q) models.

A partial autocorrelation (PACF) is a summary of the relationship between an observation in a time series with observations at prior time steps with the relationships of intervening observations removed. It is used to determine the auto regression (AR or p) term of the ARIMA(p,d,q) models.

And we have observed till now that our energy consumption time series has a strong correlation with its past day value (lag of 24 hours) and also its past value 365*24 hours ago. In addition a weekly seasonality is also observed in the energy consumption. Let's plot the ACF and PACF plots for our energy time series data.

Plotting the ACF and PACF plots

In [89]:
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
In [90]:
def acf_pacf_plots(ts, lags, figsize = (12,8)):
    
    fig, (ax1, ax2) = plt.subplots(2, 1, figsize = figsize)
    
    # Plot the ACF of ts
    _ = plot_acf(ts, lags = lags, zero = False, ax = ax1, alpha = 0.05)

    # Plot the PACF of ts
    _ = plot_pacf(ts, lags = lags, zero = False, ax = ax2, alpha = 0.05)
In [91]:
# Create the acf and pacf plots

dfacf = []
dfacf = ts_sdge['SDGE']
lags = 50

acf_pacf_plots(dfacf, lags = lags, figsize = (12,8))

The seasonal period of 24 hours can be easily observed from the plots above.

In [92]:
# Differencing the data this time to remove the trend and seasonality
dfacf = []
dfacf = ts_sdge['SDGE']
dfacf = dfacf.diff().dropna() 
dfacf = dfacf.diff(24).dropna()
dfacf = dfacf.diff(24*365).dropna()
lags = 100

acf_pacf_plots(dfacf, lags = lags, figsize = (12,8))

Even after differencing and seasonal differencing the data, the ACF and PACF plots are not very informative, they do not tell us which AR and MA terms will be good for the SARIMAX model (S stands for seasonality and X for exogenous variables).

This is a common pitfall of the ACF, PACF plots and SARIMAX models. They cannot handle multiple seasonality. And our energy consumption dataset has multiple seasonalities - daily (24 hours), weekly (168hours), yearly (24*365.25). We'll have to deal with the multiple seasonality separately to be able to use SARIMAX model on our energy consumption time series


6.2. Handling Multiple seasonality


There are two interesting time series forecasting methods called BATS and TBATS that are capable of modeling time series with multiple seasonalities link. Unfortunately BATS and TBATS capabilities do not come for free. The method is very generic. Under the hood it builds and evaluates many model candidates. This results in slowness of the computation. And SARIMAX models with Fourier series handling the multiple seasonalities can perform as good as the TBATS model, so we will opt for the simpler model here i.e. SARIMAX. We will need to create some extra features here to model the multiple seasonalities.

Adding fourier cyclical series for hour, year and week periods

In [93]:
import warnings
warnings.filterwarnings('ignore')

# We can add multiple fourer series with different k terms in the (2*k*pi) such as k=1,2,3...etc. To generalize the problem, 
# we could have chosen an optimal k value for each season by trying out some k values and choosing the values giving 
#the lowest AIC.

def add_fourier_terms(df, year_k, week_k, day_k):
    """
    df: dataframe to add the fourier terms to 
    year_k: the number of Fourier terms the year period should have. Thus the model will be fit on 2*year_k terms (1 term for 
    sine and 1 for cosine)
    week_k: same as year_k but for weekly periods
    day_k:same as year_k but for daily periods
    """
    
    for k in range(1, year_k+1):
        # year has a period of 365.25 including the leap year
        df['year_sin'+str(k)] = np.sin(2 *k* np.pi * df.index.dayofyear/365.25) 
        df['year_cos'+str(k)] = np.cos(2 *k* np.pi * df.index.dayofyear/365.25)

    for k in range(1, week_k+1):
        
        # week has a period of 7
        df['week_sin'+str(k)] = np.sin(2 *k* np.pi * df.index.dayofweek/7)
        df['week_cos'+str(k)] = np.cos(2 *k* np.pi * df.index.dayofweek/7)


    for k in range(1, day_k+1):
        
        # day has period of 24
        df['hour_sin'+str(k)] = np.sin(2 *k* np.pi * df.index.hour/24)
        df['hour_cos'+str(k)] = np.cos(2 *k* np.pi * df.index.hour/24) 
In [94]:
warnings.filterwarnings('ignore')

# as said above the k terms for each yearly, weekly and daily seasonalities could be chosen by optimizing on the AIC values..
# but after some research on energy consumption time series k= 5 was chosen for each seasonality
add_fourier_terms(lag_sdge, year_k= 5, week_k=5 , day_k=5)
In [95]:
#ph = -0.7
#y1 = np.sin(2 * np.pi * lag_sdge.index.hour/24+ph) + np.cos(2 * np.pi * lag_sdge.index.hour/24+ph)
#y2 = np.sin(2 * 2* np.pi * lag_sdge.index.hour/24) + 0.7*np.cos(2 *2* np.pi * lag_sdge.index.hour/24)
#y3 = np.sin(2 * 2* np.pi * lag_sdge.index.hour/24) + np.cos(2 *2* np.pi * lag_sdge.index.hour/24)

#plt.plot(lag_sdge.index, 1-(y1+y2))
#plt.xlim(['2014-01-02', '2014-01-03'])
#plt.xticks(rotation=90)
In [96]:
# Visualizing the new variables on hour seasonality
pd.plotting.register_matplotlib_converters()
_ = (1-lag_sdge.loc['01-02-2014':'01-02-2014', [col for col in lag_sdge if col.startswith('hour')]]).sum(axis = 1).plot()
In [97]:
# Visualizing the new variables on year seasonality
_ = (1-lag_sdge.loc['01-01-2014':'01-01-2015', [col for col in lag_sdge if col.startswith('year')]]).sum(axis = 1).plot()
In [98]:
# Visualizing the new variables on week seasonality
_ = (1-lag_sdge.loc['01-01-2014':'01-09-2014', [col for col in lag_sdge if col.startswith('week')]]).sum(axis = 1).plot()
In [99]:
lag_sdge.columns
Out[99]:
Index(['SDGE', 'year', 'HourlyDryBulbTemperature', 'cum_AC_kW',
       'time_of_day_evening', 'time_of_day_morning', 'time_of_day_night',
       'season_winter', 'non_working_working', 'lag1', 'lag2', 'lag3', 'lag4',
       'lag5', 'lag6', 'lag7', 'lag8', 'lag9', 'lag10', 'lag11', 'lag12',
       'lag13', 'lag14', 'lag15', 'lag16', 'lag17', 'lag18', 'lag19', 'lag20',
       'lag21', 'lag22', 'lag23', 'lag24', 'year_sin1', 'year_cos1',
       'year_sin2', 'year_cos2', 'year_sin3', 'year_cos3', 'year_sin4',
       'year_cos4', 'year_sin5', 'year_cos5', 'week_sin1', 'week_cos1',
       'week_sin2', 'week_cos2', 'week_sin3', 'week_cos3', 'week_sin4',
       'week_cos4', 'week_sin5', 'week_cos5', 'hour_sin1', 'hour_cos1',
       'hour_sin2', 'hour_cos2', 'hour_sin3', 'hour_cos3', 'hour_sin4',
       'hour_cos4', 'hour_sin5', 'hour_cos5'],
      dtype='object')
In [100]:
# Choosing a subset of the above dataframe; removing the lags and the hour bins
sdgecyc = lag_sdge.drop([col for col in lag_sdge if 
                         col.startswith('time') or col.startswith('season') or col.startswith('lag')], axis=1)

6.3. SARIMAX


Trying the SARIMAX model on the above fourier series added dataset

In [101]:
from statsmodels.tsa.statespace.sarimax import SARIMAX
In [102]:
import pmdarima as pm
In [103]:
# defining the exogenous variables for the SARIMAX model
#sdgecyc.drop('SDGE', axis=1)
  • The SARIMAX model takes many inputs which require some tuning to be able to pick the best model parameters. $pmdarima$ package's auto_arima module automates this task and gives us the best model by taking in some input ranges, similar to gridsearchcv.
In [104]:
sdgecyc.columns
Out[104]:
Index(['SDGE', 'year', 'HourlyDryBulbTemperature', 'cum_AC_kW',
       'non_working_working', 'year_sin1', 'year_cos1', 'year_sin2',
       'year_cos2', 'year_sin3', 'year_cos3', 'year_sin4', 'year_cos4',
       'year_sin5', 'year_cos5', 'week_sin1', 'week_cos1', 'week_sin2',
       'week_cos2', 'week_sin3', 'week_cos3', 'week_sin4', 'week_cos4',
       'week_sin5', 'week_cos5', 'hour_sin1', 'hour_cos1', 'hour_sin2',
       'hour_cos2', 'hour_sin3', 'hour_cos3', 'hour_sin4', 'hour_cos4',
       'hour_sin5', 'hour_cos5'],
      dtype='object')
In [105]:
# Creating the training and test datasets
# check https://medium.com/@josemarcialportilla/using-python-and-auto-arima-to-forecast-seasonal-time-series-90877adff03c

# not adding year variable here because the model will use the most recent lag energy consumption values
cols_to_transform = ['HourlyDryBulbTemperature', 'cum_AC_kW']  
X_train_lag, X_test_lag, y_train_lag, y_test_lag = train_test(sdgecyc, 
                                                              test_size = 0.15, scale = True, 
                                                              cols_to_transform=cols_to_transform)

# Since ARIMA model uses the past lag y values, scaling the energy values as well. 
#i.e. fit the scaler on y_train and transform it and also transform y_test using the same scaler

scaler1 = StandardScaler()
y_train_lag = pd.DataFrame(scaler1.fit_transform(y_train_lag.values.reshape(-1,1)), index = y_train_lag.index, 
                           columns = ['SDGE'])
# y_test_lag = scaler1.transform(y_test_lag)
# The code below can be used to train and find the optimal SARIMAX model parameters p,q,P,Q. #This was run and the optimal model was found to be SARIMAX(2,1,1)x(1,0,1,24). The auto regressive term p=2 means two values #from the past (1 and 2 hours behind) will be used and moving average term of q=1 means 1 past term will be used as the moving #average term . d=1 term means the energy series will differenced once. Seasonal period m of 24 hours here and P=Q=1 means the auto regressive term (P) and the moving average (Q) of #exactly 1*24 hours behind will be used. # results = pm.auto_arima(y_train_lag, #data d=1, # non-seasonal difference order start_p=0, # initial guess for p start_q=0, # initial guess for q max_p=2, # max value of p to test max_q=2, exogenous= X_train_lag, #including the exogenous variables seasonal=True, # is the time series seasonal m=24, # the seasonal period #D=1, # seasonal difference order start_P=1, # initial guess for P start_Q=1, # initial guess for Q max_P=1, # max value of P to test max_Q=1, # max value of Q to test information_criterion='aic', # used to select best model trace=True, # print results whilst training error_action='ignore', # ignore orders that don't work stepwise=True, # apply intelligent order search ) #results.fit(y_train_lag, exogenous= X_train_lag)
In [106]:
# Got the optimal SARIMAX order from running the above auto_arima function. It took more than 2 hours to get the optimal...
#...solution. So, just using the final values here
model_opt = SARIMAX(y_train_lag, order=(2,1,1), seasonal_order=(1, 0, 1, 24), exog = X_train_lag, trend='c')
In [107]:
results = model_opt.fit()
In [108]:
results.summary()
Out[108]:
Statespace Model Results
Dep. Variable: SDGE No. Observations: 37230
Model: SARIMAX(2, 1, 1)x(1, 0, 1, 24) Log Likelihood 42363.165
Date: Mon, 09 Sep 2019 AIC -84644.330
Time: 13:37:52 BIC -84294.811
Sample: 01-02-2014 HQIC -84533.323
- 04-02-2018
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
intercept -9.145e-05 0.000 -0.453 0.651 -0.000 0.000
year -0.0182 0.062 -0.292 0.770 -0.140 0.104
HourlyDryBulbTemperature -0.0141 0.002 -7.423 0.000 -0.018 -0.010
cum_AC_kW 6.5309 1.642 3.979 0.000 3.314 9.748
non_working_working -0.0080 0.015 -0.540 0.589 -0.037 0.021
year_sin1 3.5414 1.761 2.012 0.044 0.091 6.992
year_cos1 -1.9545 1.820 -1.074 0.283 -5.522 1.613
year_sin2 -0.1210 0.726 -0.167 0.868 -1.544 1.302
year_cos2 0.6177 0.783 0.789 0.430 -0.917 2.153
year_sin3 -0.2134 0.404 -0.528 0.597 -1.005 0.578
year_cos3 -0.2465 0.422 -0.585 0.559 -1.073 0.580
year_sin4 -0.1796 0.252 -0.714 0.476 -0.673 0.314
year_cos4 -0.0192 0.258 -0.074 0.941 -0.525 0.486
year_sin5 -0.0430 0.168 -0.256 0.798 -0.373 0.287
year_cos5 -0.0882 0.181 -0.487 0.626 -0.443 0.267
week_sin1 -0.0001 0.009 -0.014 0.989 -0.018 0.018
week_cos1 0.0025 0.006 0.421 0.674 -0.009 0.014
week_sin2 -0.0046 0.002 -2.652 0.008 -0.008 -0.001
week_cos2 0.0008 0.002 0.342 0.732 -0.004 0.006
week_sin3 -0.0106 0.001 -8.769 0.000 -0.013 -0.008
week_cos3 0.0003 0.001 0.249 0.803 -0.002 0.002
week_sin4 -0.0112 0.001 -9.253 0.000 -0.014 -0.009
week_cos4 0.0003 0.001 0.249 0.803 -0.002 0.002
week_sin5 -0.0076 0.002 -4.326 0.000 -0.011 -0.004
week_cos5 0.0008 0.002 0.342 0.732 -0.004 0.006
hour_sin1 -0.7241 0.043 -16.876 0.000 -0.808 -0.640
hour_cos1 -0.3384 0.033 -10.298 0.000 -0.403 -0.274
hour_sin2 -0.2909 0.017 -17.112 0.000 -0.324 -0.258
hour_cos2 -0.3022 0.017 -17.709 0.000 -0.336 -0.269
hour_sin3 0.0199 0.010 1.931 0.053 -0.000 0.040
hour_cos3 -0.0325 0.009 -3.516 0.000 -0.051 -0.014
hour_sin4 0.0461 0.006 7.427 0.000 0.034 0.058
hour_cos4 0.0078 0.006 1.301 0.193 -0.004 0.020
hour_sin5 0.0261 0.004 6.316 0.000 0.018 0.034
hour_cos5 -0.0197 0.004 -4.765 0.000 -0.028 -0.012
ar.L1 0.2345 0.028 8.321 0.000 0.179 0.290
ar.L2 0.0971 0.017 5.628 0.000 0.063 0.131
ma.L1 0.3890 0.028 13.794 0.000 0.334 0.444
ar.S.L24 0.9589 0.001 799.344 0.000 0.957 0.961
ma.S.L24 -0.6938 0.003 -240.635 0.000 -0.699 -0.688
sigma2 0.0060 2.38e-05 253.787 0.000 0.006 0.006
Ljung-Box (Q): 3964.50 Jarque-Bera (JB): 59529.39
Prob(Q): 0.00 Prob(JB): 0.00
Heteroskedasticity (H): 1.18 Skew: 0.00
Prob(H) (two-sided): 0.00 Kurtosis: 9.19


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 4.81e+21. Standard errors may be unstable.
In [109]:
#X_train_lag.columns
In [110]:
# plotting the residuals and checking if they meet the i.i.d requirements
_ = results.plot_diagnostics(figsize=(12, 7))
In [111]:
# Predictions on test set. Predicting only the last week of the training set

# Setting dynamic = True so that the model won't use actual enegy values for prediction. Basically the model will use
# the lag terms and moving average terms of the already forecasted energy values. So, we will see the errors 
#(confidence interval) increasing with each forecast.
pred = results.get_prediction(start=X_train_lag.index[-24*7], end=X_train_lag.index[-1], 
                              dynamic=True, exog=X_train_lag.iloc[-24*7:, :])
pred_ci = pred.conf_int()

pred1 = scaler1.inverse_transform(pred.predicted_mean)
pred_ci1 = scaler1.inverse_transform(pred.conf_int())

y_actual_train = np.squeeze(scaler1.inverse_transform(y_train_lag))
y_actual_train = pd.Series(y_actual_train, index = X_train_lag.index )

pred1 = pd.Series(pred1, index = X_train_lag.iloc[-24*7:, :].index )
pred_ci1 = pd.DataFrame(pred_ci1, index = pred1.index, columns = ['lower SDGE', 'upper SDGE'])

lower_limits = pred_ci1.loc[:,'lower SDGE']
upper_limits = pred_ci1.loc[:,'upper SDGE']
In [112]:
# Error on training set for 1 week ahead forecast
error_metrics(pred1, y_actual_train.iloc[-24*7:], 'SARIMAX(2,1,1)x(1,0,1,24) with Fourier terms 1 week', 
              test=False) 
Error metrics for model SARIMAX(2,1,1)x(1,0,1,24) with Fourier terms 1 week
RMSE or Root mean squared error: 363.92
Variance score: -0.92
Mean Absolute Error: 317.13
Mean Absolute Percentage Error: 16.74 %
In [113]:
#plot_ts_pred(y_actual_train.iloc[-24*7:], pred1 , model_name='SARIMA (0,1,2)x(1,0,1,24)', 
 #            og_ts_opacity = 0.5, pred_ts_opacity = 0.5)
In [114]:
# Predictions on test set. Predicting only the 1st week of the test set
pred = results.get_forecast(steps = 24*7, exog = X_test_lag.iloc[:24*7, :])
pred_ci = pred.conf_int()

pred2 = scaler1.inverse_transform(pred.predicted_mean)
pred_ci2 = scaler1.inverse_transform(pred.conf_int())

y_actual = y_test_lag.iloc[:24*7]

pred2 = pd.Series(pred2, index = X_test_lag.iloc[:24*7, :].index )
pred_ci2 = pd.DataFrame(pred_ci2, index = pred2.index, columns = ['lower SDGE', 'upper SDGE'])

lower_limits = pred_ci2.loc[:,'lower SDGE']
upper_limits = pred_ci2.loc[:,'upper SDGE']

# plot the predictions
plt.figure(figsize = (15,7))
_ = plt.plot(y_actual.index, y_actual, label='observed')

# plot your mean predictions
_ = plt.plot(pred2.index, pred1.values, color='r', label='forecast')

# shade the area between your confidence limits
_ = plt.fill_between(lower_limits.index, lower_limits, upper_limits, color='pink')

# set labels, legends and show plot
_ = plt.xlabel('Date')
_ = plt.ylabel('Energy consumption MWH')
_ = plt.legend()
_ = plt.title('Testing the SARIMAX (2,1,1)x(1,0,1,24) model on the testing set: predicting the 1st week')
In [115]:
# Checking if the SARIMA model captures the long term seasonality 
#plot_ts_pred(y_actual, pred1 , model_name='SARIMA (0,1,2)x(1,0,1,24)', 
#             og_ts_opacity = 0.5, pred_ts_opacity = 0.5)

# It fails terribly to capture the long term seasonality
In [116]:
error_metrics(pred2, y_actual, 'SARIMAX(2,1,1)x(1,0,1,24) with Fourier terms 1 week', test=True) 
Error metrics for model SARIMAX(2,1,1)x(1,0,1,24) with Fourier terms 1 week
RMSE or Root mean squared error: 150.46
Variance score: 0.68
Mean Absolute Error: 103.46
Mean Absolute Percentage Error: 5.33 %

  • We see that the first week forecast is pretty well but even at the end of first week the forecasting performance decreases and the confidence interval values grow larger beyond the scale of the range of the energy consumption values. Thus, SARIMAX model was not able to capture long term trends but it did well on 1 week ahead forecast.
  • SARIMA models don't capture multiple seasonalities well and are also very time consuming. So, it won't be the first choice if we need both a quick and accurate forecast.
  • Errors for 1 hour ahead forecasts weren't calculated above for SARIMAX model (by usign dynamic=True) because we get excellent results using elastic net regression for 1 hour ahead forecasts using the lag variables and it is much faster to fit than SARIMAX. So, if 1 hour ahead forecasting is the goal then elastic net regression should be used.
In [117]:
sdgecyc.head(2)
Out[117]:
SDGE year HourlyDryBulbTemperature cum_AC_kW non_working_working year_sin1 year_cos1 year_sin2 year_cos2 year_sin3 ... hour_sin1 hour_cos1 hour_sin2 hour_cos2 hour_sin3 hour_cos3 hour_sin4 hour_cos4 hour_sin5 hour_cos5
Dates
2014-01-02 00:00:00 1904.0 2014 53.0 221141.892 1 0.034398 0.999408 0.068755 0.997634 0.103031 ... 0.000000 1.000000 0.0 1.000000 0.000000 1.000000 0.000000 1.0 0.000000 1.000000
2014-01-02 01:00:00 1855.0 2014 52.0 221141.892 1 0.034398 0.999408 0.068755 0.997634 0.103031 ... 0.258819 0.965926 0.5 0.866025 0.707107 0.707107 0.866025 0.5 0.965926 0.258819

2 rows × 35 columns


6.4. FB Prophet


Traditional approaches like SARIMA models often require manual data pre-processing steps (e.g. differencing to make the data stationary) and it’s also hard to explain why these models produce the prediction results to people without forecasting expertise. In addition, these models are not allowed to add additional domain knowledge to improve precision. For solving these problems, Facebook researchers recently released FBProphet, a time series forecasting tool supporting both Python and R.

FBProphet provides a decomposition regression model that is extendable and configurable with interpretable parameters. Prophet frames the forecasting problem as a curve-fitting exercise rather than looking explicitly at the time based dependence of each observation within a time series. Similar to SARIMAX, we can add extra regressor terms like temperature data to the model as well.

At it’s core, Prophet is an additive model with the following components: $$y(t) = g(t) + s(t) + h(t) + ϵₜ$$

$g(t)$ models trend, which describes long-term increase or decrease in the data. Prophet incorporates two trend models, a saturating growth model and a piecewise linear model, depending on the type of forecasting problem.
$s(t)$ models seasonality with Fourier series, which describes how data is affected by seasonal factors such as the time of the year
$h(t)$ models the effects of holidays or large events that impact business time series (e.g. new product launch, Black Friday, Superbowl, etc.)
$ϵₜ$ represents an irreducible error term

Ref1
Ref2
Ref3


  • Using only the 'SDGE', 'HourlyDryBulbTemperature','cum_AC_kW', 'non_working_working' columns while using Prophet because Prophet, unlike SARIMAX, handles multiple seasonalities well. So, we don't need to pass in the Fourier terms separately.
  • FB Prophet can be passed with a holiday feature, but since we have already captured the holidays and weekends in the 'non_working_working' column we won't pass a separate holiday list to Prophet.
In [118]:
# cols_to_transform = ['HourlyDryBulbTemperature', 'cum_AC_kW']
X_trainP, X_testP, y_trainP, y_testP = train_test\
                           (lag_sdge[['SDGE', 'HourlyDryBulbTemperature','cum_AC_kW', 'non_working_working']], 
                           test_size=0.15, 
                           scale=False, #True
                           #cols_to_transform=cols_to_transform,
                           include_test_scale=False)

Prophet requires the input data in a particular format, so preparing the data for Prophet.

In [119]:
# preparing data for Prophet
def data_prophet(X_train, X_test, y_train, y_test):
    data_train = pd.merge(X_train, y_train, left_index=True, right_index=True)
    data_train = data_train.reset_index().rename(columns = {'SDGE':'y', 'Dates':'ds'})
    data_test = pd.merge(X_test, y_test, left_index=True, right_index=True)
    data_test  = data_test.reset_index().rename(columns = {'SDGE':'y', 'Dates':'ds'})
    return data_train, data_test
In [120]:
data_train, data_test = data_prophet(X_trainP, X_testP, y_trainP, y_testP)
In [121]:
data_train.tail(3)
Out[121]:
ds HourlyDryBulbTemperature cum_AC_kW non_working_working y
37227 2018-04-02 03:00:00 60.0 885452.821 1 1684.034193
37228 2018-04-02 04:00:00 59.5 885452.821 1 1745.907962
37229 2018-04-02 05:00:00 60.0 885452.821 1 1847.432717
In [122]:
# Importing Prophet
from fbprophet import Prophet
In [123]:
# Initiating fbprophet model; set the uncertainty interval to 95% (the Prophet default is 80%)
prop = Prophet(growth='linear', interval_width = 0.95, 
                yearly_seasonality='auto',
                weekly_seasonality='auto',
                daily_seasonality='auto',
                seasonality_mode='additive',
                #seasonality_prior_scale = 15
              )
  • We can add the seasonalities separately in the Prophet model by replacing the 'auto' mode of the seasonalities above with 'FALSE'. Then we can add the yearly, weekly, daily, monthly, quarterly, etc. seasonalities using the .add_seasoanlity feature of the Prophet specifying the period in days for a seasonality along with the fourier terms to be used and the prior_scale to set.
  • After some trial and error it was concluded that for this problem adding seasonalities manually doesn't give better results than the 'auto' mode, so keeping the seasonality as 'auto'.
  • There are two mode options for any seasonality - 'additive' or 'multiplicaitve'. Multiplicaitve should be used if the seasonality affects the trend exponentially.
  • External regressor like the temperature column are added to the model by using the 'add_regressor' function of the Prophet. There is an option for Standardization while passing the regressors, so Scaler wasn't used abobe while creating the train and test splits.
In [124]:
# Adding seasonality
#prop.add_seasonality(name='daily', period = 1, fourier_order = 15, prior_scale=20)
#prop.add_seasonality(name='weekly', period = 7, fourier_order = 10, prior_scale=20)
#prop.add_seasonality(name='yearly', period = 365.25, fourier_order = 20, prior_scale=20)

# Adding regressors
prop.add_regressor('HourlyDryBulbTemperature', prior_scale=20, mode='additive', standardize=True)
prop.add_regressor('cum_AC_kW', prior_scale = 1, mode='additive', standardize=True)
prop.add_regressor('non_working_working', prior_scale=10, mode='additive', standardize='auto') #'auto'=> standardize if not
#.binary
Out[124]:
<fbprophet.forecaster.Prophet at 0x2438f477ac8>
In [125]:
# Fitting the model to training data
prop.fit(data_train)
Out[125]:
<fbprophet.forecaster.Prophet at 0x2438f477ac8>
In [126]:
# Creating the dataframe with datetime values to predict on (making predictions on train as well as the test set)
future_dates = prop.make_future_dataframe(periods=len(data_test), freq='H', include_history=True)
# Aadding regressors 
future_dates = pd.merge(future_dates, (data_train.append(data_test)).drop('y', axis=1), on = 'ds')
In [127]:
future_dates.tail(3)
Out[127]:
ds HourlyDryBulbTemperature cum_AC_kW non_working_working
43797 2018-12-31 21:00:00 51.0 1034385.178 1
43798 2018-12-31 22:00:00 51.0 1034385.178 1
43799 2018-12-31 23:00:00 50.0 1034385.178 1
In [128]:
# Predicting the future
forecast = prop.predict(future_dates)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()
Out[128]:
ds yhat yhat_lower yhat_upper
43795 2018-12-31 19:00:00 2554.571063 432.168795 4426.584096
43796 2018-12-31 20:00:00 2480.760535 296.162577 4369.376555
43797 2018-12-31 21:00:00 2328.716547 184.483223 4189.343851
43798 2018-12-31 22:00:00 2113.042798 -9.938531 4043.340264
43799 2018-12-31 23:00:00 1890.247765 -211.918998 3828.652158
In [129]:
#pd.plotting.register_matplotlib_converters()
_ = prop.plot(forecast, uncertainty = True, xlabel = 'Dates', ylabel = 'Energy consumption MWH')
In [130]:
# Plotting the components of the results
_ = prop.plot_components(forecast)
  • We can see that the multiple seasonalities were captured very well by the Prophet model.
  • In the bottom most plot, the extra regressor additve terms include the temperature, non_working_working and cum_AC_kW variables. We see the impact of temperature and working days in the form of wiggles whereas the overall impact of the cum_AC_kW is a downward trend in the energy (as was expected since more PV gets installed at customer sites, the lower the demand on the grid).
In [131]:
#pd.plotting.register_matplotlib_converters()

plot_ts_pred(data_train.append(data_test).set_index('ds')['y'], forecast.set_index('ds')['yhat'], \
             model_name='FB Prophet with auto seasonality', og_ts_opacity = 0.5, pred_ts_opacity = 0.5)

FB Prophet also provides an option for performing cross validation on the training set. It needs an inital, period and horizon as inputs. $Initial$ is the training sub set to be used for training the model on, $period$ is the number of time frames the data should be shifted forward by for extending the training subset and $horizon$ is the window of forecast. This is similar to the time series split described in the Introduction.

In [132]:
from fbprophet.diagnostics import cross_validation
In [133]:
len(y_trainP)*2//3
Out[133]:
24820

Randomly checking the performance of FB Prophet on 1 week ahead forecast within the training set.

In [134]:
# Cross validating on the training set with an initial training period of 18615 hours (half of the training set) and a forecast.
# ..horizon of 168 hours (1 week ahead)
df_cv = cross_validation(prop, initial='24820 hours', period='1200 hours', horizon = '168 hours')
df_cv.head(3)
INFO:fbprophet:Making 11 forecasts with cutoffs between 2016-11-11 05:00:00 and 2018-03-26 05:00:00
Out[134]:
ds yhat yhat_lower yhat_upper y cutoff
0 2016-11-11 06:00:00 1936.684780 1507.915814 2359.436732 2063.0 2016-11-11 05:00:00
1 2016-11-11 07:00:00 2086.141948 1680.120805 2473.015597 2159.0 2016-11-11 05:00:00
2 2016-11-11 08:00:00 2239.577238 1811.580641 2656.494554 2265.0 2016-11-11 05:00:00
In [135]:
error_metrics(df_cv['yhat'], df_cv['y'], 'FB Prophet with auto seasonality 1 week ahead', test=False)
Error metrics for model FB Prophet with auto seasonality 1 week ahead
RMSE or Root mean squared error: 203.21
Variance score: 0.76
Mean Absolute Error: 164.76
Mean Absolute Percentage Error: 7.66 %
In [136]:
error_metrics(forecast.iloc[-len(data_test['y']):, ]['yhat'], data_test['y'], 
              'FB Prophet with auto seasonality', 
              test=True)
Error metrics for model FB Prophet with auto seasonality
RMSE or Root mean squared error: 262.74
Variance score: 0.72
Mean Absolute Error: 201.06
Mean Absolute Percentage Error: 8.55 %

7. Regression models using Fourier terms


7.1. Elastic net

Let's fit the elastic net and RF models to the above dataset with fourier series

Elastic net regression on data with fourier terms

In [137]:
sdgecyc.head(2)
Out[137]:
SDGE year HourlyDryBulbTemperature cum_AC_kW non_working_working year_sin1 year_cos1 year_sin2 year_cos2 year_sin3 ... hour_sin1 hour_cos1 hour_sin2 hour_cos2 hour_sin3 hour_cos3 hour_sin4 hour_cos4 hour_sin5 hour_cos5
Dates
2014-01-02 00:00:00 1904.0 2014 53.0 221141.892 1 0.034398 0.999408 0.068755 0.997634 0.103031 ... 0.000000 1.000000 0.0 1.000000 0.000000 1.000000 0.000000 1.0 0.000000 1.000000
2014-01-02 01:00:00 1855.0 2014 52.0 221141.892 1 0.034398 0.999408 0.068755 0.997634 0.103031 ... 0.258819 0.965926 0.5 0.866025 0.707107 0.707107 0.866025 0.5 0.965926 0.258819

2 rows × 35 columns

In [138]:
# Fitting Elastic Net Regression model on the data with the fourier series
#sdgecyc['year'] = lag_sdge['year']

data = sdgecyc
cols_to_transform = ['HourlyDryBulbTemperature', 'cum_AC_kW', 'year']
l1_space = np.linspace(0, 1, 30)
alpha_space = np.logspace(-2, 0, 30)

X_trainF, X_testF, y_trainF, y_testF, elastic_net_opt_F = trend_model(data=data, cols_to_transform=cols_to_transform, 
                                                                l1_space=l1_space, alpha_space=alpha_space,
                                                                scale = True, test_size = 0.15, include_test_scale=False)
Tuned ElasticNet l1 ratio: {'alpha': 0.32903445623126676, 'l1_ratio': 0.6896551724137931}
Tuned ElasticNet R squared: 0.7280319122008612
Tuned ElasticNet RMSE: 256.93217560755096
<Figure size 360x360 with 0 Axes>

7.2. RF

Random Forest regression on data with fourier terms

In [139]:
random_grid['max_depth'] = [3,4,5,6,7,8]
In [140]:
random_grid
Out[140]:
{'n_estimators': [10, 31, 52, 73, 94, 115, 136, 157, 178, 200],
 'max_features': ['auto', 'sqrt'],
 'max_depth': [3, 4, 5, 6, 7, 8]}
In [141]:
# First create the base model to tune
rf = RandomForestRegressor()
tscv = TimeSeriesSplit(n_splits=5)

# Random search of parameters
rfF = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, 
                               cv = tscv, verbose=2, random_state = 42, n_jobs = -1)

# Fit the random search model
rfF.fit(X_trainF, y_trainF)

rfF.best_params_
#rf.fit(X_train, y_train)
Fitting 5 folds for each of 10 candidates, totalling 50 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:  2.3min finished
Out[141]:
{'n_estimators': 157, 'max_features': 'auto', 'max_depth': 8}
In [142]:
rfFourier = RandomForestRegressor(n_estimators= rfF.best_params_['n_estimators'], 
                                  max_features=rfF.best_params_['max_features'], 
                                  max_depth= rfF.best_params_['max_depth'], random_state = 42)
rfFourier.fit(X_trainF, y_trainF)
Out[142]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=8,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=157,
                      n_jobs=None, oob_score=False, random_state=42, verbose=0,
                      warm_start=False)
In [143]:
# comparison of scores using elastic net and RF

print("Training accuracy R2 using random forest is {}".format(rfF.score(X_trainF, y_trainF)))
print("Training accuracy R2 using Elastic net is {}".format(elastic_net_opt_F.score(X_trainF, y_trainF)))
print("\nTest accuracy R2 using random forest is {}".format(rfF.score(X_testF, y_testF)))
print("Test accuracy R2 using Elastic net is {}".format(elastic_net_opt_F.score(X_testF, y_testF)))
Training accuracy R2 using random forest is 0.9255257850577151
Training accuracy R2 using Elastic net is 0.701042746237448

Test accuracy R2 using random forest is 0.8390618983715401
Test accuracy R2 using Elastic net is 0.6680217021374875
In [144]:
fig = go.Figure()

fig.add_trace(go.Scatter(x = X_testF.index, y = np.array(y_testF), name = "observed",
                         line_color = 'deepskyblue', opacity = 0.5))

fig.add_trace(go.Scatter(x = X_testF.index, y = rfF.predict(X_testF), name = "Random forest predictions",
                         line_color = 'lightslategrey', opacity = 0.5))

fig.add_trace(go.Scatter(x = X_testF.index, y = elastic_net_opt_F.predict(X_testF), name = "Elastic net predictions",
                         line_color = 'lightcoral', opacity = 0.5))

fig.update_layout(title_text = 'Observed test set vs Predicted energy using Random forest and elastic net reg on data with \
Fourier series',
                  xaxis_rangeslider_visible = True)
fig.show()
In [145]:
plot_predvstrue_reg(rfF.predict(X_testF), y_testF, model_name='Random forest with fourier terms test set')
In [146]:
# comparison of scores using elastic net and RF on the data with fourier terms

print('Random forest errors with fourier terms on training')
error_metrics(rfF.predict(X_trainF), y_trainF, model_name = 'Tuned Random forest with fourier terms', test=False)
print('\nRandom forest errors with fourier terms on test')
error_metrics(rfF.predict(X_testF), y_testF, model_name = 'Tuned Random forest with fourier terms', test=True)
print('\nElastic net errors with fourier terms on train')
error_metrics(elastic_net_opt_F.predict(X_trainF), y_trainF, model_name = 'Tuned Elastic net with fourier terms', test=False)
print('\nElastic net errors with fourier terms on test')
error_metrics(elastic_net_opt_F.predict(X_testF), y_testF, model_name = 'Tuned Elastic net with fourier terms', test=True)
Random forest errors with fourier terms on training

Error metrics for model Tuned Random forest with fourier terms
RMSE or Root mean squared error: 130.73
Variance score: 0.93
Mean Absolute Error: 92.15
Mean Absolute Percentage Error: 3.79 %

Random forest errors with fourier terms on test

Error metrics for model Tuned Random forest with fourier terms
RMSE or Root mean squared error: 197.65
Variance score: 0.84
Mean Absolute Error: 137.66
Mean Absolute Percentage Error: 5.75 %

Elastic net errors with fourier terms on train

Error metrics for model Tuned Elastic net with fourier terms
RMSE or Root mean squared error: 261.93
Variance score: 0.70
Mean Absolute Error: 193.98
Mean Absolute Percentage Error: 8.08 %

Elastic net errors with fourier terms on test

Error metrics for model Tuned Elastic net with fourier terms
RMSE or Root mean squared error: 283.87
Variance score: 0.67
Mean Absolute Error: 204.26
Mean Absolute Percentage Error: 8.39 %
In [147]:
# checking the feature importance for the random forest model

feat_imp = pd.DataFrame({'importance':rfFourier.feature_importances_})    
feat_imp['feature'] = X_trainF.columns
feat_imp.sort_values(by='importance', ascending=False, inplace=True)
#feat_imp = feat_imp.iloc[:top_n]
    
feat_imp.sort_values(by='importance', inplace=True)
feat_imp = feat_imp.set_index('feature', drop=True)
_ = feat_imp.plot.barh(title = 'Random Forest feature importance', figsize = (12,7))
In [148]:
# # checking the feature importance for the Elastic net model

feat_imp = pd.DataFrame({'importance':np.abs(elastic_net_opt_F.coef_)})    
feat_imp['feature'] = X_trainF.columns
feat_imp.sort_values(by='importance', ascending=False, inplace=True)
#feat_imp = feat_imp.iloc[:top_n]
    
feat_imp.sort_values(by='importance', inplace=True)
feat_imp = feat_imp.set_index('feature', drop=True)
_ = feat_imp.plot.barh(title = 'Elastic net feature importance', figsize = (12,7))

The feature importance for both the models show that the following features are the most important: temperature, hour, year and non_working_working (whether a day is working or not).


  • Random forest performs well with the Fourier terms added. It captures the multiple seasonalities well and also gives a MAPE of only 5.76% compared to the baseline MAPE of 9.23%.
  • Elastic net regression doesn't capture the higher energy consumption values as good as the random forest does.
  • Random forest's better performance gave way for trying the tree based XGBoost model on the data.


7.3 XGBoost

XGBoost (Extreme Gradient Boosting) belongs to a family of boosting algorithms and uses the gradient boosting (GBM) framework at its core. It is an optimized distributed gradient boosting library.

XGBoost is well known to provide better solutions than other machine learning algorithms. It is not often used for time series, especially if the base used is trees because it is difficult to catch the trend with trees, but since our data doesn't have a very significant trend and also since it has multiple seasonalities and depends sinificantly on an exogenous variable like temperature, we can try XGboost to see how it performs on the time series data of energy consumption.

In [149]:
import xgboost as xgb
In [150]:
# convert the dataset into an optimized data structure called Dmatrix that XGBoost supports and 
# gives it acclaimed performance and efficiency gains
data_dmatrix = xgb.DMatrix(data = sdgecyc.drop('SDGE', axis=1), label = sdgecyc['SDGE']) # not used 
In [151]:
# generating the model
xg_reg = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,
                max_depth = 3, alpha = 5, n_estimators = 100, random_state=42)
In [152]:
# Fitting the model on the training set
xg_reg.fit(X_trainF, y_trainF)
[14:35:39] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
Out[152]:
XGBRegressor(alpha=5, base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.3, gamma=0,
             importance_type='gain', learning_rate=0.1, max_delta_step=0,
             max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
             n_jobs=1, nthread=None, objective='reg:linear', random_state=42,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
             silent=None, subsample=1, verbosity=1)
In [153]:
# Predicting on test set
preds_boost = xg_reg.predict(X_testF)
In [154]:
_ = error_metrics(preds_boost, y_testF, model_name='XGBoost with Fourier terms', test=True)
Error metrics for model XGBoost with Fourier terms
RMSE or Root mean squared error: 186.54
Variance score: 0.86
Mean Absolute Error: 136.65
Mean Absolute Percentage Error: 5.83 %
In [155]:
_ = error_metrics(xg_reg.predict(X_trainF), y_trainF, model_name='XGBoost with Fourier terms', test= False)
Error metrics for model XGBoost with Fourier terms
RMSE or Root mean squared error: 141.49
Variance score: 0.91
Mean Absolute Error: 101.45
Mean Absolute Percentage Error: 4.17 %
In [156]:
plot_ts_pred(y_trainF, xg_reg.predict(X_trainF), model_name='XGBoost with Fourier terms on Training set')
In [157]:
plot_ts_pred(y_testF, preds_boost, model_name='XGBoost with Fourier terms on Test set')
In [158]:
plot_predvstrue_reg(preds_boost,y_testF, model_name='XGBoost with Fourier terms on Test set')

Since, usually the grid operator needs to plan for the maximum demand of a day, it is good to check the error metrics on the max daily demand forecast of the model as well.

In [159]:
# converting the predictions into series so that we can use the .resample method on it
preds_boost_series = pd.Series(preds_boost, index = y_testF.index )
# Resampling both the y_test and predictions at a 24 hours period and using the max as the aggregate function
_ = error_metrics(preds_boost_series.resample('24h').max(), 
                  y_testF.resample('24h').max(), 
                  model_name='XGBoost with Fourier terms; daily MAX', test=True)
Error metrics for model XGBoost with Fourier terms; daily MAX
RMSE or Root mean squared error: 172.84
Variance score: 0.85
Mean Absolute Error: 126.23
Mean Absolute Percentage Error: 4.08 %
In [160]:
# ...on training set max
preds_train = pd.Series(xg_reg.predict(X_trainF), index = y_trainF.index )
# Resampling both the y_test and predictions at a 24 hours period and using the max as the aggregate function
_ = error_metrics(preds_train.resample('24h').max(), 
                  y_trainF.resample('24h').max(), 
                  model_name='XGBoost with Fourier terms; daily MAX', test=False)
Error metrics for model XGBoost with Fourier terms; daily MAX
RMSE or Root mean squared error: 149.13
Variance score: 0.85
Mean Absolute Error: 108.30
Mean Absolute Percentage Error: 3.52 %
In [161]:
# For plotting the tree
import os
os.environ["PATH"] += os.pathsep + 'C:/Users/ppawar/graphviz-2.38/release/bin'
In [178]:
#pd.plotting.register_matplotlib_converters()
xgb.plot_tree(xg_reg,num_trees=98) # Choosing the last few trees because the model optimizes sequentially
plt.rcParams['figure.figsize'] = [30, 40]
In [180]:
#pd.plotting.register_matplotlib_converters()
xgb.plot_importance(xg_reg)
plt.rcParams['figure.figsize'] = [15, 7]

Tuning the hyper parameters for the XGBoost model

In [164]:
random_grid
Out[164]:
{'n_estimators': [10, 31, 52, 73, 94, 115, 136, 157, 178, 200],
 'max_features': ['auto', 'sqrt'],
 'max_depth': [3, 4, 5, 6, 7, 8]}
In [165]:
# Tuning the XGBoost model
xgbtuned = xgb.XGBRegressor()
In [166]:
param_grid = {
        'max_depth': [3, 4, 5, 6, 7, 8, 9, 10],
        'learning_rate': [0.001, 0.01, 0.1, 0.2, 0.3],
        'subsample': [0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
        'colsample_bytree': [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
        'colsample_bylevel': [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
        'min_child_weight': [0.5, 1.0, 3.0, 5.0, 7.0, 10.0],
        'gamma': [0, 0.25, 0.5, 1.0],
        'n_estimators': [10, 31, 52, 73, 94, 115, 136, 157, 178, 200]}

tscv = TimeSeriesSplit(n_splits=3)
xgbtunedreg = RandomizedSearchCV(xgbtuned, param_distributions=param_grid , 
                                   scoring='neg_mean_squared_error', n_iter=20, n_jobs=-1, 
                                   cv=tscv, verbose=2, random_state=42)

xgbtunedreg.fit(X_trainF, y_trainF)
best_score = xgbtunedreg.best_score_
best_params = xgbtunedreg.best_params_
print("Best score: {}".format(best_score))
print("Best params: ")
for param_name in sorted(best_params.keys()):
    print('%s: %r' % (param_name, best_params[param_name]))
Fitting 3 folds for each of 20 candidates, totalling 60 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   53.4s
[Parallel(n_jobs=-1)]: Done  60 out of  60 | elapsed:  1.4min finished
[14:37:14] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
Best score: -30999.88788122807
Best params: 
colsample_bylevel: 0.9
colsample_bytree: 0.6
gamma: 0
learning_rate: 0.1
max_depth: 6
min_child_weight: 5.0
n_estimators: 73
subsample: 0.9
In [167]:
preds_boost_tuned = xgbtunedreg.predict(X_testF)
In [168]:
_ = error_metrics(preds_boost_tuned, y_testF, model_name='Tuned XGBoost with Fourier terms', test=True)
Error metrics for model Tuned XGBoost with Fourier terms
RMSE or Root mean squared error: 172.21
Variance score: 0.88
Mean Absolute Error: 121.19
Mean Absolute Percentage Error: 5.08 %
In [169]:
_ = error_metrics(xgbtunedreg.predict(X_trainF), y_trainF, model_name='Tuned XGBoost with Fourier terms', test=False)
Error metrics for model Tuned XGBoost with Fourier terms
RMSE or Root mean squared error: 89.32
Variance score: 0.97
Mean Absolute Error: 63.22
Mean Absolute Percentage Error: 2.60 %
In [170]:
plot_ts_pred(y_testF, preds_boost_tuned, model_name='Tuned XGBoost with Fourier terms on test')
In [171]:
plot_predvstrue_reg(preds_boost_tuned,y_testF, model_name='Tuned XGBoost with Fourier terms on test set')
  • We can see how fast the XGBoost model is. The entire model was tuned and fit in around 2 minutes.
  • The scores didn't improve much by using the tuned parameters, in fact the overfitting increased. There is more scope for tuning using Gridsearch instead of Random search since not all combinations were tried, but we can go with the default parameters as done with our base model xg_reg as the performance didn't change much by tuning.

7.4. XGBoost + FB Prophet trend

As discussed earlier, to model a time series data it needs to be stationary. So, the ideal case would be to detrend the data and then feed it into the ML models and then add the trend to the forecasted results. Nonetheless good results were obtained above without detrending because the energy consumption data from 2014 to 2018 has a very weak trend and the multiple seasonalities were handled well by the Fourier terms.

Also, the effect of cum_AC_kW, which is the cumulative PV installation till date, can be modeled using FB Prophet and then merged with the XGBoost's forecast. Any tree based regression model like XGBoost cannot handle the X variables like cum_AC_kW because it is an ever increasing variable and the test data will always have higher magnitude values not seen by the model in the training set.

I have extracted the trend and cum_AC_kW impact on energy from the FB Prophet model and subtracted these two components from our main dataframe with all the fourier terms. Then this detrended energy consumption data was passed onto the XGBoost model and the XGBoost forecast results were added back to the total trend to get the final predictions.

In [172]:
_ = plt.figure(figsize = (8,5))
_ = (forecast.trend + forecast.cum_AC_kW).plot()
In [188]:
fxdata = sdgecyc.copy()
In [189]:
fxdata.drop(['cum_AC_kW', 'year'], axis = 1, inplace=True)
In [190]:
# Detrending the data
fxdata['SDGE'] = fxdata['SDGE'].to_numpy() - (forecast.trend + forecast.cum_AC_kW).to_numpy()
In [195]:
fxdata['SDGE'].isna().sum()
Out[195]:
0
In [197]:
pd.plotting.register_matplotlib_converters()
fxdata['SDGE'].plot()
Out[197]:
<matplotlib.axes._subplots.AxesSubplot at 0x243966eaeb8>
In [198]:
data1 = fxdata
cols_to_transform = ['HourlyDryBulbTemperature'] # technically we don't need to standardize data for tree based models

X_trainFP, X_testFP, y_trainFP, y_testFP = train_test(data=data1, cols_to_transform=cols_to_transform, 
                                                                scale = True, test_size = 0.15, include_test_scale=False)
In [199]:
# generating the model
xg_FP = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,
                max_depth = 5, alpha = 5, n_estimators = 73, random_state=42)
In [200]:
# Fitting the model on the training set
xg_FP.fit(X_trainFP, y_trainFP)
[14:44:03] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
Out[200]:
XGBRegressor(alpha=5, base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.3, gamma=0,
             importance_type='gain', learning_rate=0.1, max_delta_step=0,
             max_depth=5, min_child_weight=1, missing=None, n_estimators=73,
             n_jobs=1, nthread=None, objective='reg:linear', random_state=42,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
             silent=None, subsample=1, verbosity=1)
In [201]:
# Predicting on test set
preds_boostFP = xg_FP.predict(X_testFP)
In [202]:
# adding back the trend
trend_and_PV = forecast.trend + forecast.cum_AC_kW
In [203]:
boost_add_trend = preds_boostFP + trend_and_PV[-len(y_testF):]
In [204]:
plot_ts_pred(y_testF, boost_add_trend.to_numpy(), model_name='XGBoost with detrend, Fourier terms on test')
In [205]:
plot_predvstrue_reg(boost_add_trend.to_numpy(),y_testF, model_name='XGBoost with detrend Prophet, Fourier terms on test')
In [206]:
_ = error_metrics(boost_add_trend, y_testF, model_name='XGBoost with detrend Prophet, Fourier terms', test=True)
Error metrics for model XGBoost with detrend Prophet, Fourier terms
RMSE or Root mean squared error: 212.97
Variance score: 0.81
Mean Absolute Error: 170.77
Mean Absolute Percentage Error: 7.30 %
In [207]:
# on training set
preds_boostFP_train = xg_FP.predict(X_trainFP)
# adding back the trend
boost_add_trend_train = preds_boostFP_train + trend_and_PV[:(len(trend_and_PV) - len(y_testF))]
In [208]:
_ = error_metrics(boost_add_trend_train, y_trainF, model_name='XGBoost with detrend Prophet, Fourier terms', test=False)
Error metrics for model XGBoost with detrend Prophet, Fourier terms
RMSE or Root mean squared error: 129.81
Variance score: 0.93
Mean Absolute Error: 95.06
Mean Absolute Percentage Error: 3.95 %
  • The combined model is performing worse than the XGBoost alone because FB Prophet over estimates the trend and cum_AC_kW impact on the energy, especially at the end of the time scale, as can be seen from the plot above for forecast.trend + forecast.cum_AC_kW which shows a sudden dip at the end. This seems to result in under prediction of the energy consumption and impacts the overall results.

8. Conclusion


Different models were tried to forecast the energy consumption in MWH of the San Diego Gas and Electric (SDGE) utility region. The energy consumption is highly dependent on the outside temperature and has strong multiple seasonalities - daily, weekly and yearly. Also, the energy consumption has decreased slightly from 2014 to 2018. The increasing PV (photovoltaic) installations in the region (cum_AC_kW) seems to have brought the decreasing trend in the energy consumption since more renewable energy at customer's facility means less load on the utility. But there can be other factors causing this decreasing trend such as the energy storage installations at the customer facilities, increase in electric efficiencies of the household and commerical equipment, people becoming more conscious of their usage (morally or through utility incenitves), etc.

The best way to capture the trend, which is a combination of all the above factors and maybe more, is to make the model learn the trend over a long period of time (more than 3 years at least). The seasonality is an important part in predicting the energy consumption of a region, so getting that part right was also very crucial for improving the model's performance.

Different models were tried and here is a summary of the error metric for each model including the baseline model where today's energy consumption is equal to the last year's energy consumption at the same hour:

In [209]:
trydf = pd.DataFrame.from_dict(dict_error)
#trydf.sort_values('MAPE', ascending=True).groupby(['model', 'train_test']).mean()
sorted_errors = trydf.pivot_table(index = 'model', columns = 'train_test', 
                                  aggfunc='min').sort_values(('MAPE', 'test'), ascending=True)
In [210]:
table = (sorted_errors.sort_index(axis=1, level=1, ascending=False).sort_index(axis=1, level=[0], sort_remaining=False)).\
round(3)
table.style.highlight_min(['MAPE', 'MAE', 'RMSE'], 
                                  axis=0).highlight_max(['R2'], axis=0).highlight_null(null_color='gray')
Out[210]:
MAE MAPE R2 RMSE
train_test train test train test train test train test
model
Elastic net with all lags 36.36 37.358 1.511 1.579 0.988 0.989 51.586 50.749
Random forest with all lags 60.526 63.703 2.56 2.733 0.972 0.971 79.515 83.823
XGBoost with Fourier terms; daily MAX 108.3 126.228 3.517 4.077 0.855 0.846 149.125 172.841
Tuned XGBoost with Fourier terms 63.223 121.192 2.599 5.082 0.965 0.878 89.317 172.214
SARIMAX(2,1,1)x(1,0,1,24) with Fourier terms 1 week 317.134 103.459 16.736 5.326 -0.916 0.676 363.925 150.457
Tuned Random forest with fourier terms 92.155 137.664 3.786 5.746 0.926 0.839 130.733 197.646
XGBoost with Fourier terms 101.453 136.651 4.168 5.825 0.913 0.857 141.492 186.538
XGBoost with detrend Prophet, Fourier terms 95.059 170.77 3.948 7.298 0.927 0.813 129.81 212.972
Ridge regression with scaling 176.284 194.251 7.341 8.182 0.744 0.716 242.498 262.806
Simple linear regression with scaling 170.175 194.836 7.128 8.367 0.769 0.733 230.146 254.839
Tuned Elastic net with fourier terms 193.976 204.258 8.084 8.387 0.701 0.668 261.93 283.867
FB Prophet with auto seasonality nan 201.058 nan 8.551 nan 0.716 nan 262.74
Baseline Persistent forecast, repeat last year; daily MAX nan 264.435 nan 8.599 nan 0.22 nan 389.371
Baseline Persistent forecast, repeat last year nan 224.889 nan 9.231 nan 0.549 nan 330.737
Tuned Random forest with reduced hour space 193.409 223.084 8.205 9.585 0.724 0.658 251.793 288.068
Tuned elastic net regression with reduced hour space 212.492 236.888 8.996 9.977 0.663 0.599 277.959 311.883
FB Prophet with auto seasonality 1 week ahead 164.757 nan 7.663 nan 0.762 nan 203.214 nan
In [211]:
_  = sorted_errors['MAPE'].plot()
_ = plt.xticks(ticks = range(0,len(sorted_errors)), labels = sorted_errors.index.values, rotation=90)
_ = plt.figure(figsize=(8,6))
<Figure size 576x432 with 0 Axes>

Note: NaN values are for the models which couldn't be tested on a training set.

  • Based on the MAPE and RMSE scores, the XGBoost model with the Fourier terms has performed very well, predicting for a forecasting window of 8 months ahead. For an hourly data with multiple seasonalities that is a pretty impressive result.

  • For long term forecasts- most of the models have performed better than the baseline persistence model and the best model (XGBoost) gives a MAPE of 5.08% compared to the baseline error of 9.23% on the test set. The RMSE, R2 and MAE values are also considerably lower than the baseline model. The difference in RMSE with the baseline model is almost 160 MW which is pretty significant.

  • FB Prophet does a very good job in identifying the trend and seasonalities of the data. It can be paired with XGBoost and a more robust long term forecast can be obtained. The trend and cum_AC_kW regression coefficient information was extracted from trained FB Prohet model and it was used it to detrend the training set for XGBoost model. The XG Boost predictions were then combined with the predictions from both the models to give an aggregate forecast. Unfortunately, this didn't work very well as the forecast at the end of our time series seems to drop down at a faster rate than the observed.

  • While comparing the models, I am not considering the elastic net and random forest regression with all lags because they are only good for hour ahead forecasts. Even SARIMA does comparably well for short term hour-ahead or week ahead forecasts. So, pretty much any of the model given above using lag variables should be good for short term forecasts (95-98% R2 and 1-3% MAPE). Elastic net should be used for short term forecasts, given it had the highest accuracy and also the model training time is very less compared to SARIMAX.

  • SARIMA performs terribly bad on long term forecasts and doesn't capture the multiple seasonalities well. Maybe more feature engineering can be done to help SARIMA identify multiple seasonalities but given how time consuming the model training for SARIMA is, it is better to focus the resources on other models.

9. Future work

  • Try more methods where time series models and other traditional ML models could be clubbed together for better performance. Also, for the FB prophet and XGBoost combination, need to check if trying different combinations of variable distribution among the two models gives better results. i.e. prediciting the impact of $X_n$ variables using 1 model and that of $N-X_n$ variables using other model where $N$ is the total numbe of $X$ variables being considered.
  • Add std. error regions to the predictions.
  • Bring in the new data of 2019 year and use it as a fresh test set.
  • Try LSTM and SVR(linear & rbf both)
  • Bring in the battery (energy storage) installation and Electric vehicle (EV) ownership data. A combination of solar panels and batteries usually result in lower dependency of a house or a building on the grid because their combination makes them more self-sufficient. And electric vehicles can add load at different times to the grid, so, battery installation and EV data clubbed with the PV data can help us improve the energy consumption prediction.
  • Try some feature engineering by multiplying the PV and energy storage cumulative capacities with the cyclical hour data since their impact on energy consumption of a facility is highly time dependent. For example, solar panels produce electricity only during the day, so we can help the model understand that the PV impact should be accounted for only during the sunlight hours.